From Data to Knowledge with Workflows & Provenance
-
Upload
bertram-ludaescher -
Category
Data & Analytics
-
view
280 -
download
7
description
Transcript of From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & Provenance
Bertram Ludäscher
Graduate School of Library and Information Science (GSLIS) Affiliate:
National Center for Supercomputing Applications (NCSA) Department of Computer Science (CS @ Illinois)
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Outline
• About Yours Truly – … where I’m coming from – … strange loops …
• From Data To Knowledge … • … Scientific Workflows (CI “Upper-Ware”) • … and Provenance (part of CI “Underware”)
• Other Research Interests & Projects – Reprise (… me not) – Sept. 19: CIRSS Seminar @ GSLIS (Reasoning about Taxonomies) – Sept. 23: (Oct 7) Yahoo!-DAIS Seminar@CS (First-order Provenance Games)
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Some Personal Provenance … • Studies of Computer Science at Uni Karlsruhe (TH)
– … my Alma Mater now defunct!?? L – … deus ex machina: K.I.T. (Karlsruhe Institute of Technology) J – Fridericiana Polytechnic (1825) ... TU Karlsruhe (1865) ... KIT (2009)
• Undergrad work: Task-Setup Service (TSS) – part of HECTOR (HEterogeneous Computers TOgetheR, IBM & U-KA), top-layer above
DACNOS (Distributed Academic Network Operating System) – early “upper-ware”!
• … (scientific) workflows!!
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Sacred Scrolls … Prophesizing the Grid (DACNOS) & workflows (TSS)
Foerster, Cora. "Controlling Distributed User Tasks in Heterogeneous Networks." In HECTOR: Heterogeneous Computers Together. A Joint Project of IBM and the University of Karlsruhe. Springer Berlin Heidelberg, 1988.
“All this has happened before, and all this will happen again”
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
… too much C hacking … on to AI & Logic!
• Workflows? Hacking? – Boring…
• Databases?? – Boooring!!
• AI, Logic Programming? – Sounds good! – Non-monotonic reasoning
• Well-founded semantics • Stable models (now ASP)
• MSc (Diplom) – First-order theorem prover
(BDD variant)
“All this has happened before, and all this will happen again”
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
… and onto (logic) databases!
• PhD at University of Freiburg
“All this has happened before, and all this will happen again”
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
… fast forward to the present (back to the future!)
• Datalog becomes popular again: – Datalog 2.0 in Oxford and Vienna: The resurgence of
Datalog in academia and industry
• Statelog is in demand again – The Declarative Imperative: Experiences and Conjectures in
Distributed Logic. Joe Hellerstein. PODS Keynote, 2010.
• LogicBlox Inc. (Atlanta) – Re-invent how enterprise software is built – Under the hood: LogiQL
• … a high-performance Datalog engine
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Datalog Plus: l Skolem functions l Existentials in the head l Meta-Programming layer l Integration with LP Solvers l Expressive constraints l ...
Language Execution-Engine Cloud: l Cost-based optimizer l Versioned data-structures l Full serializability Browser: l Compiled to Javascript
Re-invent how enterprise software is built
Unified Runtime
based on Datalog
Vision
Molham Aref, LogicBlox
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
• 1998-2004 SDSC & CSE Dept – NARA, digital libraries
• w/ Reagan Moore
– Data Integration research – Started Kepler
• w/ Matt Jones, Ilkay Altintas, … • Head start: Ptolemy II (open source)
– EECS @ Berkeley (E.A. Lee)
– Naming things is fun! • Mediation of Information in XML (MIX) • Blended Browsing & Querying (BBQ) • Knowledge-based Information Integration of
Neuroscience Data (KIND) • Ptolemy.. Copernicus … Kepler! • Neon… Geosciences … Network … GEON!
… down by the sea !
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
… from SoCal to NorCal … to the Midwest!
• 2004-2014 UC Davis
• Major projects (finished) – Kepler/CORE, pPOD, ChIP-chip,
COMET, SDM, REAP • Ongoing & new:
– FilteredPush – Euler, Exploring Taxon Concepts – DataONE – Kurator
• Research themes (& names :-) – Scientific data mgmt, workflows,
provenance, KR&R, data curation … – Kepler/COMAD, X-CSR, Euler …
UC DAVIS Department of Computer Science
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
The 4th Paradigm
• CI, e-Science • bioinformatics • ecoinformatics • geoinformatics
• Big Data • Data Science • Information Science • Digital Humanities …
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Scientific Workflows: Cyberinfrastructure “Upperware”
Underware
Middleware
Upper Middleware
Upperware
NSF/SEEK ITR collaboration (2002-2008): SDSC, UCSB, UC Davis, UNM, UK, …
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Problem: Stitching together Tools and Databases • Tool Integration
– local, remote, tools, services, databases, applications
– BLAST on myPC? – My R script on the
cluster? • Data Handling
– Where’s the data? Access methods?
– A.out doesn’t fit B.in – Many runs, experiments
• Automate, optimize, scale, reuse, share wfs
• “Explain” results
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
“Integration Technologies” for Data, Tools, Models
• State of the art in tool integration often involves plumbing, stitching, and stapling …
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Scientific Workflows: ASAP! • Automation
– wfs to automate computational aspects of science – batch processing, scripting
• Scaling (exploit and optimize machine cycles) – wfs should make use of parallel compute resources
• dataflow-orientation avoids von Neumann bottleneck • use parallel MoCs when deploying on cluster, cloud
– wfs should be able handle large data • Abstraction, Evolution, Reuse (human cycles)
– wfs should be easy to change, evolve, share, reuse • Provenance
– wfs should capture processing history, data lineage è traceable data- and wf-evolution è Reproducible Science
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Find OTUs
(OTUHunter)
Assign Taxonomy (STAP)
Profile alignment
(STAP or Infernal)
Build phylogeneAc tree (RaxML or Quicktree)
View tree: Dendroscope
UniFrac: tree &
environment file
Assembled conAgs
Chimera check
(Mallard)
Diversity staAsAcs: Text: OUT list, Chao1, Shannon
Graphs: rarefacAon curves, rank-‐abundance curves
VisualizaAon tools: Cytoscape networks & Heat map
WATERS: Workflow for Alignment, Taxonomy, Ecology of Ribosomal Sequences (Amber Hartman; Eisen Lab; UC Davis)
+/-‐ cipres
+/-‐ cluster
+/-‐ cluster
+/-‐ cluster
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Executable WATERS Workflow in Kepler
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Example Bioinformatics Workflow: Motif-Catcher
Marc Facciotti et al. UC Davis Genome Center
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Motif-Catcher workflow, implemented in Kepler
S Köhler et al. Improved Motif Detection in Large Sequence Sets with Random Sampling in a Kepler workflow, ICCS-WS, 2012
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
A Data-Streaming Workflow over Sensor Data
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Kepler Workflows & Decision Making (Kruger Natl. Park, South Africa)
SANParks Matt Jones, NCEAS @ UC Santa Barbara
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Scientific workflows: a(nother) silver bullet?
Beware of the Turing tar-pit in which everything is possible but nothing of interest is easy.
—Alan Perlis
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Scientific Workflow Design: Some Challenges
“And the graphical UI makes our scientific workflows so much easier to develop, understand and maintain!”
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Human Cycles vs Machine Cycles
• Traditional Computer Science and HPC focus: – optimize algorithms, save
machine cycles – massively parallelize
execution • The most expensive cycles:
– Human cycles! – Big scalability issues …
• cf. Bernie’s “Big Data” ~ big problems with data!
• Not either one or the other: – … better together! (cf. BSG)
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Overview: My Scientific Workflow Research
Modeling & Design
Provenance
Parallel Execution
Fault-Tolerance, Crash Recovery
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
• Monitor and control supercomputer simulations
– 50+ composite actors (subworkflows)
– 4 levels of hierarchy – 1000+ atomic (Java) actors
43 actors, 3 levels
196 actors, 4 levels 30 actors
206 actors, 4 levels
137 actors 33 actors
150 123 actors
66 actors 12 actors
243 actors, 4 levels
Norbert Podhorszki ORNL (then: UC Davis)
Programming in the large?
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
"Structured Plumbing" in Kepler
Cabellos et al. Computer Physics Communications 182, 2011
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Modeling & Design: Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt
• Vanilla Process Network
• Functional Programming Dataflow Network
• XML Transformation Network
• Collection-oriented Modeling & Design framework (COMAD)
– “Look Ma: No Shims!”
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Problems with [too many] Shims and Wires
• Shims need to be placed and connected – Tedious, error-prone
• Distract from scientific meaningful actors – Non-descriptive workflows – worth sharing?
• Data Organization is encoded in workflow structure – Not robust to data changes
• Shims often lead to complex designs – Imagine all previous `design-patterns’ intertwined – GOTO-programming
COMAD/VDAL: Raising the level of abstraction " Localized control-flow
" Data management not done via wires
" Actors are coupled not by wire but by data!
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Pipelined Collection-Oriented Workflows Collection-Oriented Modeling & Design (COMAD)
– fully embrace the assembly line metaphor
– data = tagged nested collections
– e.g. represented as flattened, pipelined (XML) token streams:
Actors (like assembly line workers), pass on what they don’t work on
T McPhillips, S Bowers, D Zinn, B Ludäscher
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Layers in COMAD / VDAL Pipelines
WF Graph
Configurations (white-box)
Scientific Functions (black-boxes)
CipresRAxML In: DNASeq+
Thres: Float
Method: String
Out: (t:Tree, s:score)+
• Access data in XML stream • Call Scientific Functions (Services) • Put results back into stream
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
COMAD/VDAL Actor Execution Semantics
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Two different workflow designs
• Hardwiring vs. configurable data/collection management • brittle vs. change resilient designs • scientist can recognize napkin drawing/conceptual model • Human cycles are expensive
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
ADIOS in Kepler
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
ADIOS in COMAD
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Conceptual Pipeline w/ Scopes & Types
Daniel Zinn et al. ICDE’09
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Optimizing Execution Schedules: Paral�lel
Paral·lel (Barcelona
Metro)
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
X-CSR (“XML Scissor”): Cut-Ship-Reassemble
Daniel Zinn et al. ICDE’09
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Workflow Execution Analysis and Optimization
A:1
B:1
d1
d2
d3
Actor A Queue Actor B
2
Comadlayer<C>
</C>
d1
d2
d3
<C>
</C>
B:1:2
B:1:3
B:1:1
Comadlayer
3
COMAD: Kepler PN:
Optimal Schedule:
Analysis + Data mining
Sven Köhler
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Dataflow Network (generic) and Views
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Kahn Process Networks
Kahn, Gilles & David MacQueen. "Coroutines and networks of parallel processes." (1976).
Kahn, Gilles. "The semantics of a simple language for parallel programming." (1974)
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Synchronous Dataflow (SDF)
Lee, Edward A., and David G. Messerschmitt. "Synchronous data flow." Proc. of the IEEE 75, no. 9 (1987): 1235-1245.
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Workflow Recovery in SDF
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Idea: “Rescue DAG” (cf. Condor/DAGMan)
Sven Köhler et al. Improving Workflow Fault Tolerance through Provenance-Based Recovery. SSDBM 2011
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
COMAD
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
VisTrails [Juliana Freire, et al]
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Restflow (Tim McPhillips)
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
So many MoCs, so little time …
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Outline
• About Yours Truly – … where I’m coming from – … strange loops …
• From Data To Knowledge … • … Scientific Workflows (CI “Upper-Ware”) • … and Provenance (part of CI “Underware”)
• Other Research Interests & Projects – Reprise (… me not) – Sept. 19: CIRSS Seminar @ GSLIS (Reasoning about Taxonomies) – Sept. 23: Yahoo!-DAIS Seminar @ CS (First-order Provenance Games)
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
From “Climate Gate” to Reproducible Science
Capturing provenance is crucial for transparency, interpretation, debugging, … => repeatable experiments, => reproducible science => need workflow-system agnostic model
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Data & Provenance Management: Model Chains
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
The Data Life Cycle
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
From Data Life-Cycle to Curation Life-Cycle
Uncanny Resemblance: Eye of Jupiter (“Vision Thing”?)
DCC Curation Lifecycle
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Common Uses of Provenance Data in Science
• Audit trail: trace data generation and possible errors • Attribution: determine ownership and responsibility for data
and scientific results • Data quality: from quality of input data, computations • Discovery: enable searching of data, methodologies
and experiments • Replication: facilitate repeatable derivation of data to
maintain currency ⇒ Reproducible Science But: different MoCs imply different Observables (and
“Knowables”) è different MoPs
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
The Executable Paper
Executable Paper Grand Challenge International Conference on
Computational Science, ICCS 2011 The Collage Authoring Environment
Piotr Nowakowskia*et al.
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Motivation: Virtual Joint Experiments
• How do we ensure that Charlie gets a complete account of the history of Wc’s outputs?
• How do we ensure that Alice gets her due (partial) credit when Charlie uses Bob’s data v? è traces TA and TB will be critical è need to compose them to obtain TC
We can view the composition WC as a new, virtual workflow
Charlie
Alice
(1) develop! WA
(2) run! RA
z x Bob
(3) develop!WB
(5) run!RB
v u f
v
WC:=
(6) inspect
provenance!
(7) understand,
generate!W
A W
S W
B
u z x
(4) data sharing!
TA! TB!f -1
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Provenance Composition: the Data Tree of Life (DToL) • We can formulate our questions in terms of provenance
of the datasets produced by virtual workflow WC: – What is the complete provenance of v?
• Answering the question requires tracing v’s derivation all the way to x
• But, to achieve this, we need to ensure: • TA and TB are properly connected • Provenance queries run seamlessly over and across TA and TB
Charlie
Alice
(1) develop! WA
(2) run! RA
z x Bob
(3) develop!WB
(5) run!RB
v u f
v
WC:=
(6) inspect
provenance!
(7) understand,
generate!W
A W
S W
B
u z x
(4) data sharing!
TA! TB!f -1
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Scientific Workflow Provenance in Action
WF Engine
ProvExplorer
ReproZip DataONE
ReproZip
WF Engine
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Data Quality & Curation Workflows • Collections & occurrence data
is all over the map – … literally (off the map!)
• Issues: – Lat/Long transposition,
coordinate & projection issues – Data entry/creation, “fuzzy”
data, naming issues, bit rot, data conversions and transformations, schema mappings, … (you name it)
• Filtered-Push Collaboration
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Filtered-Push: Kurator (Data Curation Workflows)
Tianhong Song
Lei Dou (former member)
Sven Köhler
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Data Curation Pipeline (w/ your friends in the loop)
[SPHNC’2011]
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Curation Workflow: Features
• Human-in-the-loop – You “wrapped” your buddies/experts into the workflow!
• Uses Open Authorization • Certain changes captured in the data
– ... by workflow developer/engineer – Highlighted in the spreadsheets (cf. “duplicate records”)
• Automatic capture of provenance information – data lineage and processing history
• Provenance information – can be visualized, browsed, and queried
[SPHNC’2011]
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Koogle: Google Cloud + Kepler
[SPHNC’2011]
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Koogle Kuration package: Kepler + Google cloud (esp. spreadsheet) services
actors functions importer import data to a spreadsheet exporter export data from a spreadsheet copy copy a spreadsheet from a template share share the spreadsheet with another user query query data from the spreadsheet
auditor allow human interaction during the execution of the workflow
[SPHNC’2011]
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
You’ve got Mail! (Two curation requests)
[SPHNC’2011]
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Inspect, edit (if necessary), submit!
[SPHNC’2011]
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
… second request
[SPHNC’2011]
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
DONE! Summary message…
[SPHNC’2011]
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
[SPHNC’2011]
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
http://www.youtube.com/watch?v=DEkPbvLsud0
[SPHNC’2011]
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
FilteredPush Curation Provenance (Spreadsheet View)
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
… and then there is One More Thing …
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
An End-to-End Climate Workflow
Configure Climate Model
Data Repository
Search Data
Process Data
Model Inputs
Build Climate Model
Run Climate Model
Model Outputs
Exploration, Visualization, & Analysis
Uncertainty Quantification
Diagnostics Generation
Exploratory Analysis
Model Benchmarking Archive Data
Repository
Src: Yaxing Wei, ORNL (EVA WG)
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Model Benchmarking using UV-CDAT
Workflow
Result
Src: Yaxing Wei, ORNL (EVA WG)
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
DataONE Provenance & Semantics Use Case
The North American Carbon Program Multi-Scale Synthesis and Terrestrial Model Intercomparison Project D. N. Huntzinger1, C. Schwalm2, A. M. Michalak3, K. Schaefer4,5, A. W. King6, Y. Wei6, A. Jacobson4,7, S. Liu6, R. B. Cook6, W. M. Post6, G. Berthier8, D. Hayes6, M. Huang9, A. Ito10, H. Lei11,12, C. Lu13, J. Mao6, C. H. Peng14,15, S. Peng8, B. Poulter8, D. Riccuito6, X. Shi6, H. Tian13, W. Wang16, N. Zeng17, F. Zhao17, and Q. Zhu15
Provenance • Externally facing • Internally facing
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
D-OPM: DataONE version of OPM for sci-wf
D-OPM (DataONE ProvWG)
OPM-W Daniel Garijo, Yoland Gil
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Structural Integrity: Traces è Workflows
Structural integrity
Implied temporal constraints
Temporal constraint declaration
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Logic / Rule-based Provenance Analyzer
Related: Prov-WG
Saumen Dey
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
From Models of Computation to Models of Provenance
M. Anand, S. Bowers, et al., SSDBM’09
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Fine-grained, Data & MoC-aware MoP
M. Anand, S. Bowers, et al., SSDBM’09
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Hamming Numbers (executable Kepler workflow)
Compute Hamming numbers H in order, where a.k.a. regular numbers or 5-smooth numbers (numbers whose prime divisors are less or equal to 5).
Babylonian clay tablet with annotations. The diagonal displays an approximation of the square root of 2 in four sexagesimal figures, which is about six decimal figures. 1 + 24/60 + 51/602 + 10/603 = 1.41421296...
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Two Hamming workflow variants: H1 vs. H3
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
It's Quiz-Time again! X2
X3
X5
S2
S3
S5
Q1
Q2
Q3
M1
M2
Q4
Q5
Q6
Q7
Q8
X2
X3
X5
S2
S3
S5
Q1
Q2
Q3
M1
M2
Q4
Q5
Q6
Q7
Q8
Hamming Trace
Does it match Hamming Workflow H1?
… or Hamming Workflow H3 ??
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Hamming Traces – "Debugged"
1
2
3
5
4
6
10
9
15
25
8
12
20
18
30
50
27
45
75
16
24
40
36
60
100
125
54
90
150
32
48
80
72
120
200
81
135
225
250
108
180
300
375
64
96
160
144
240
400
162
270
450
500
216
360
600
625
243
405
675
750
128
192
320
288
480
800
324
540
900
1000 432
720
486
810
256
384
640
576
960
648
729
864
972
512
768
1
2
3
5
4
6
10
9
15
25
8
12
20
18
30
50
27
45
75
16
24
40
36
60
100
125
54
90
150
32
48
80
72
120
200
81
135
225
250
108
180
300
375
64
96
160
144
240
400
162
270
450
500
216
360
600
625
243
405
675
750
128
192
320
288
480
800
324
540
900
1000
432
720
486
810
256
384
640
576
960
648
729
864
972
512
768
Trace of H1 ("Fish") Trace of H3 ("Sail")
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Provenance & Privacy (ProPub: Provenance Publisher)
Saumen Dey, UC Davis
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Meet Prof. Nico Franz: Curator of Insects @ ASU
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
From Tool Users to Tool Makers
Screen capture… back to the original definition
NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher
Conclusion: Better Together
• Human & Machine Cycles – Better information and workflow modeling (COMAD/VDAL) – and more scalable execution (X-CSR, tagged dataflow, …)
• Theory & Practice – Experimental theory (CS problems + ASP + Info Vis)
• e.g. rediscovering Dedekind numbers via taxonomy debugging – D(N) = |monotone Boolean functions over N variables|
– Information Science & Software-Carpentry • Support tool makers!
• Big Data, Data Science, and all the rest! – Excited to work at the intersection of GSLIS & NCSA & CS!