Post on 21-Dec-2015
1
Yolanda Gil, PhDInformation Sciences Institute andDepartment of Computer ScienceUniversity of Southern California
gil@isi.edu
http://www.isi.edu/~gil
Scientific Reproducibility through Semantic Workflows and
Shared Provenance Representations
2
NSF Workshop on Challenges of Scientific Workflows [Gil et al IEEE Computer 2007]
Despite investments on CyberInfrastructure as an enabler of a significant paradigm change in science:
• Reproducibility, key to scientific method, is threatened• Exponential growth in Compute, Sensors, Data storage,
Network BUT growth of science is not same exponential What is missing:
• Perceived importance of capturing and sharing process in accelerating pace of scientific advances
• Process (method/protocol) is increasingly complex and highly distributed
Workflows are emerging as a paradigm for process-model driven science that captures the analysis itself
Workflows need to be first class citizens in science CyberInfrastructure
• Enable reproducibility• Accelerate scientific progress by automating processes
Interdisciplinary and intradisciplinary research challenges Report available at http://www.isi.edu/nsf-workflows06
3
Benefits of Workflow Systems [Taylor et al 07] Managing execution
Remote job submission Dependencies among
steps Failure recovery
Managing distributed computation Move data when needed
Managing large data sets Efficiency, reliability
Security and access control Access to shared
resources Provenance recording
Low-cost high-fidelity reproducibility
4
Capabilities Available Today: Wings/Pegasus Workflows for Seismic Hazard Analysis [Gil et al 07] (see also [Maechlin et al 05] [Deelman et al 06])
Input data: a site and an earthquake forecast model
• thousands of possible fault ruptures and rupture variations, each a file, unevenly distributed
• ~110,000 rupture variations to be simulated for that site
High-level template combines 11 application codes 8048 application nodes in the workflow instance
generated by Wings Provenance records kept for 100,000 workflow data
products• Generated more than 2M triples of metadata
24,135 nodes in the executable workflow generated by Pegasus, including:
• data stage-in jobs, data stage-out jobs, data registration jobs
Executed in USC HPCC cluster, 1820 nodes w/ dual processors) but only < 144 available
• Including MPI jobs, each runs on hundreds of processors for 25-33 hours
• Runtime was 1.9 CPU years
5
The Wings/Pegasus Workflow System[Gil et al 07; Deelman et al 03; Deelman et al 05; Kim et al 08; Gil et al forthcoming]
Grid servicescondor.uwisc.eduwww.globus.org
Pegasus:Automated workflow refinement and executionpegasus.isi.edu
WINGS:Semanticworkflow environmentwings.isi.edu
•Knowledge-based reasoning on workflows and data (W3C’s OWL)
•Semantic workflow catalogs•Automation and assistance•Execution-independent workflows•Optimize for performance, cost, reliability
•Assign execution resources•Manage execution through DAGMan
•Daily operational use in many domains•Secure and controlled sharing of distributed services, computing, data
•Scalable service-oriented architecture
•Commercial quality, open sourceIBM
IBM
IBM
IBM
6
Semantic Workflows in WINGS[Gil et al IEE IS 2010; Gil et al JETAI 2010; Gil et al eScience 2009; Kim et al JCCPE 2008; Gil et al 2007]
Semantic workflows:• More than a dataflow
graph• Workflow variables:
each constituent (node, link, component, dataset) has a corresponding variable
• Semantic constraints on workflow variables, both within and across variables
• Semantic descriptions of collections of of data and components are concisely represented
[modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)]
(TestData dcdom:isDiscrete false)(TrainingData dcdom:isDiscrete false)
7
Workflow Portal for Genetic Studies of Mental Disorders (with E. Deelman and C. Mason)
Existing repository of genotypic and phenotypic information
Goal: develop workflows useful for data in the repository
8
Designing a Workflow Collection for Population Genomics
Designed workflows for common analysis types• Association tests• CNV detection• Variant discovery• Family-based association analysis (TDT)
Developed workflow components by encapsulating widely-used heterogeneous open software
• Plink (Purcell, Harvard)• R (Chambers et al)• PennCNV (Penn) -- Hidden Markov Models• Gnosis (State, Yale) -- sliding windows• Allegro (Decode, Iceland) -- Multiterminal Binary Decision Diagrams• Structure (Pritchard, Chicago) -- structured association• FastLink (Schaffer, NCBI)• (BWA) Burrows-Wheeler Aligner (Li * Durbin)• SAMTools
9
Wings Workflows for Genetic Studies of Mental Disorders [Gil et al, forthcoming]
CNV Detection
Variant Discovery from Resequencing
Transmission Disequilibrium Test (TDT)
Association Tests
10
Major Features Workflow system
manages set up and execution
• Wings – set up• Pegasus -
execution Initial collection of
workflows captures common genomic analyses
Users can upload their own datasets
• Including collections of datasets
User data is secure• Not accessible by
others
11
Wings Replication of Crohn’s Disease Association Study from [Duerr et al, Science 06]
12
Wings Replication of Early-Onset Parkinson’s Disease Study from [Bayrakli et al, Human Mutation 07]
13
Observations about Reproducibility with Workflows [Gil et al, forthcoming]
Effort involved in reproducing results is minor• 30 seconds to set up a workflow
A catalog of carefully crafted workflows of select state-of-the-art methods will cover a wide range of genomic analyses• Our workflows were independently developed and used “as is”
Semantic representations abstract the analysis method from the software that implements it• Our workflows used different analytic tools than the original
studies• Many implementations of same algorithm, some proprietary
Semantic constraints can be added to workflows to avoid analysis errors• Eg: in association analysis workflow, added constraint to remove
duplicate individuals initially to avoid problems downstream
14
Benefits of Semantic Workflows [Gil JSP-09]
Execution management: Automation of workflow
execution Managing distributed
computation Managing large data sets Security and access
control Provenance recording Low-cost high fidelity
reproducibility
Semantics and reasoning:
User assistance to correctly explore analysis “design space”
Validation of analyses Automated generation of
metadata Workflow retrieval and
discovery “Conceptual”
reproducibility
15
W3C Provenance Group (Y. Gil, chair):Goals
Provide state-of-the-art understanding and develop a roadmap for development and possible standardization
Articulate requirements for accessing and reasoning about provenance information• Develop use cases
Identify issues in provenance that are direct concern to the Semantic Web• Articulate relationships with other aspects of Web architecture
Report on state-of-the-art work on provenance Report on a roadmap for provenance in the Semantic
Web• Identify starting points for provenance representations• Identifying elements of a provenance architecture that would
benefit from standardization
16
W3C Provenance Group:Products of the Group to Date
Group formed in September 2009, open to new members• All information is public: http://www.w3.org/2005/Incubator/prov/wiki/
Developed a set of key dimensions for provenance (11/09)• Grouped into three major categories: content, management, use
Developed use cases for provenance (12/09)• More than 30 use cases, including ~10 in science but others are
relevant Developed requirements for provenance from use cases (1/10)
• User requirements: what is the purpose of the provenance information • Technical requirements: derived from the user requirements
Report on “Requirements for Provenance on the Web” Currently developing state-of-the-art report (expected 6/10) Started to develop recommendations (expected 9/10)
• Mappings across provenance vocabularies (eg: DC, OPM, SWAN,…)