Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

12
Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn Bowers , Timothy McPhillips, Sean Riddle, Manish Anand, and Bertram Ludäscher DAKS Lab, Genome Center, Univ. of California at Davis Dept. of Computer Science, Univ. of California at Davis

description

UC DAVIS Department of Computer Science. Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life. The Kepler/pPOD Team Shawn Bowers , Timothy McPhillips, Sean Riddle, Manish Anand, and Bertram Ludäscher - PowerPoint PPT Presentation

Transcript of Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Page 1: Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

UC DAVISDepartment ofComputer Science

The Kepler/pPOD Team

Shawn Bowers, Timothy McPhillips, Sean Riddle, Manish Anand, and Bertram Ludäscher

DAKS Lab, Genome Center, Univ. of California at DavisDept. of Computer Science, Univ. of California at Davis

Page 2: Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Background

“The AToL initiative (Assembling the Tree of Life) is a large research effort sponsored by the National Science Foundation. Its goal is to reconstruct the evolutionary origins of all living things.” – http://atol.sdsc.edu

AToL projects Investigate relationships among specific groups of organisms Develop new computational techniques Expectation that projects will collaborate & share data

Technology barriers Exchanging data between collaborators & other projects Data “lives” in many different kinds of applications Similar analyses performed, but ad hoc (manually or scripts) Provenance of data and results

Page 3: Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Project Overview

pPOD (processing phylodata) Develop core database technologies for the AToL

community Data access, data integration, scientific analysis,

provenance Collaboration among Univ. of Pennsylvania, Yale Univ.,

Univ. of Florida, and UC Davis

Kepler/pPOD @ UC Davis Scientific workflows for phylogenetic data analysis Workflow execution and data provenance

Page 4: Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Basic architecture

• Tools & analyses• Integrate w/ data model• Provenance recording within

and across workflow runs

Workflow Automation

(Kepler/pPOD)

• Application schema mappings• Curation (w/ provenance)• Privacy and trust policies• P2P support

Data Integration & Exchange(Orchestra)

Existing Applications

Tolkin

TreeBASE

AToL LabDB

mappings to core model(via Orchestra)

• Data types for sequences, trees, …• Provenance relationships • Expressive query language (OQL)• Persistence tools

Core AToL Data Model

Page 5: Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Kepler/pPOD workflows

Uses Sequence alignment, tree inference, post-tree analysis, … Track analyses run and data produced within projects Use, test, compare different computational techniques

Characteristics Exploratory (design, run, modify, commit, …) Intertwined with manual steps (e.g., edit alignment) Many formats, few data types (sequences, trees, matrices, …) Pipelined (e.g., multiple sets of sequences)

Kepler/pPOD Status “Preview release” of Kepler/pPOD: Kepler + pPOD extensions

workflow design (via Comad) wrapped apps: Phylip, Clustal, MrBayes, RaXML, tree drawing, … provenance recording and browsing

Page 6: Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Kepler/pPOD workflows

new directordata types, collections

assembly-line processingprovenance enabled

actor libraryCipres web services

local applicationsformat conversion

GUI componentsworkspace extensionaccess to workflows

access to run “traces”

Page 7: Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Kepler/pPOD workflows

integrated provenance browserdata & process dependencies

“forward” & “rewind” runmultiple views

Page 8: Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Comad: “Virtual Assembly Lines”

Actors select parts of token stream, forward rest Special tokens denote collections, metadata, & parameters Actors insert tokens into and remove tokens from stream Some advantages of Comad

workflows with loops, branches, composition (subworkflows) concurrency, pipelining resilient to change (data nesting, add/remove actors) simpler workflow designs

……

Compute Consensus

… …

Proj

Seqs Aligns

… …

Trees

S1 S10 A1 A2 T1 T5

>< < < >>><

<A

lign

s>

</A

lig

ns

> <P

roj>

</P

roj>

<S

eq

>

</S

eq

>

<T

ree

s>

</T

ree

s>

S10 S1A2 A1T5 T1T6T6

Page 9: Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

… but (efficiently) representing provenance?

Many approaches require storing all input and output for each actor invocation (transformers) can lead to significant redundancy in Comad

We use an “XML-diff” approach augmented with data provenance special provenance tokens … … insertions, (marked) deletions, invocation dependencies exploit collections and apply inference rules only store final result containing input and provenance

X Y“Conventional”All of X and Y stored for A1

… … … …

A1

“Comad”Store change and explicit dependencies for A1

A1

ins(A1)del(A1)

Page 10: Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Kepler/pPOD Provenance Browser

Reusable “widgets” for viewing different aspects of a trace Move “forward” and “backward” through execution Data dependencies, collection structure, actor invocations

Page 11: Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Kepler/pPOD Provenance Browser

Collection and invocation view Incrementally step through execution history Actor invocation graph shows pipelining, implicit branches

Page 12: Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Poster/Demo & Questions …

Please come to our poster/demo :-)

Preview release of Kepler/pPOD available http://daks.ucdavis.edu/kepler-ppod

Ongoing and future work Adding more actors for phylogenetic analyses Extending with “project histories” Incremental query support Integrate with AToL Core Data Model