1 Yolanda Gil ([email protected])USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets...
-
Upload
jasmin-stone -
Category
Documents
-
view
226 -
download
0
Transcript of 1 Yolanda Gil ([email protected])USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets...
1Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Metadata Meets Semantic Workflows
Yolanda Gil, PhDInformation Sciences Institute and
Department of Computer ScienceUniversity of Southern California
http://www.isi.edu/~gil
With Ewa Deelman, Jihie Kim, Varun Ratanakar, Christian Fritz,
Paul Groth, Gonzalo Florez, Pedro Gonzalez, Joshua Moody
2Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Outline
Brief introduction to computational workflows
Brief overview of semantic workflows• The Wings/Pegasus workflow system
Five benefits of semantic workflows• Reproducibility• Validation• Metadata generation• Data discovery• Workflow discovery
3Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Scientific Data Analysis
Complex processes involving a variety of algorithms/software
4Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
NSF Workshop on Challenges of Scientific Workflows [Gil et al, IEEE Computer 2007]
Despite investments on CyberInfrastructure as an enabler of a significant paradigm change in science:• Reproducibility, key to scientific method, is threatened• Exponential growth in Compute, Sensors, Data storage, Network
BUT growth of science is not same exponential What is missing:
• Perceived importance of capturing and sharing process in accelerating pace of scientific advances
• Process (method/protocol) is increasingly complex and highly distributed
Workflows are emerging as a paradigm for process-model driven science that captures the analysis itself
Workflows need to be first class citizens in science CyberInfrastructure• Enable reproducibility• Accelerate scientific progress by automating processes
Interdisciplinary and intradisciplinary research challenges
Report available at http://www.isi.edu/nsf-workflows06
5Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Benefits of Workflow Systems [Taylor et al 07]
Managing execution Dependencies among
steps Failure recovery
Managing distributed computation Move data when needed
Managing large data sets Efficiency,
reliability Security and access control Remote job submission
Provenance recording Low-cost high-
fidelity reproducibility
6Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Wings/Pegasus Workflows for Seismic Hazard Analysis [Gil et al 07] (see also [Maechlin et al 05] [Deelman et al 06])
Input data: a site and an earthquake forecast model• thousands of possible fault ruptures and
rupture variations, each a file, unevenly distributed
• ~110,000 rupture variations to be simulated for that site
High-level template combines 11 application codes
8048 application nodes in the workflow instance generated by Wings
Provenance records kept for 100,000 workflow data products• Generated more than 2M triples of metadata
24,135 nodes in the executable workflow generated by Pegasus, including:• data stage-in jobs, data stage-out jobs, data
registration jobs Executed in USC HPCC cluster, 1820 nodes w/
dual processors) but only < 144 available• Including MPI jobs, each runs on hundreds of
processors for 25-33 hours• Runtime was 1.9 CPU years
7Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Semantic Workflows
in WINGS Workflow templates Dataflow diagram
• Each constituent (node, link, component, dataset) has a corresponding variable
Semantic properties Constraint
s on workflow variables
(TestData dcdom:isDiscrete false)(TrainingData dcdom:isDiscrete false)
8Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Semantic Constraints as Metadata Properties
Constraints on reusable template (shown below)
Constraints on current user request (shown above)
[modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)]
9Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Why Semantic Workflows:1) Easily Replicate Previously Published Results
A catalog of carefully crafted workflows of select state-of-the-art methods to cover a wide range of common analyses• Many implementations of same algorithm, some proprietary• Same implementation but new versions and bug fixes
Semantic workflows abstract from software implementation• Representing abstract classes of software components
– Instances are the implemented codes– Workflow steps refer to component classes
• Representing abstract kinds of data (eg exclude format) Semantic reasoning needed to specialize workflow
• To map the abstract workflow into an execution-ready workflow
• To insert lower level steps (eg data transformations)
10Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
The Importance of Reproducibility
QuickTime™ and a decompressor
are needed to see this picture.
11Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Difficulties in Replication
Some software is proprietary
Effort must be invested in data conversions
Software installation
Managing new versions
12Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Wings Workflows for Genetic Studies of Mental Disorders [Gil et al, forthcoming]
Work with Christopher Mason from Cornell University
CNV Detection
Variant Discovery from Resequencing
Transmission Disequilibrium Test (TDT)
Association Tests
13Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Wings Replication of Crohn’s Disease Association Study from [Duerr et al, Science 06]
10MB 2.4 GB
152 MB
32 MB
Running time: 20.5 hrs
14Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Wings Replication of Early-Onset Parkinson’s Disease Study from [Bayrakli et al, Human Mutation 07]
15Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Observations [Gil et al, forthcoming]
Effort involved in reproducing results is minor• 30 seconds to set up a workflow
A catalog of carefully crafted workflows of select state-of-the-art methods will cover a wide range of genomic analyses• Our workflows were independently developed and used “as is”
Semantic representations abstract the analysis method from the software that implements it• Our workflows used different analytic tools than the original studies
Semantic constraints can be added to workflows to avoid analysis errors• Our workflow removes duplicate individuals that would cause problems in the association analysis
16Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Why Semantic Workflows:2) Ensure Correct Use of State-of-the-Art Methods
Analytic software and methods are well documented but all is text (papers, manuals, etc)• Time consuming, hard to spot interdependencies, no validation
Semantic workflows can check constraints and guide users• Representing requirements of software components
– Constraints on input data– Constraints on parameter settings given properties of input data
• Representing metadata properties of datasets Semantic reasoning needed:
• To check constraints of each workflow step• To propagate constraints across the workflow
17Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
User’s Difficulties: Choosing Parameters
How do I set up the workflow parameters?
Association Test
Max individuals per cluster (“mc”)and merge distance p-value constraint (“ppc”)
Max Population
If Affimetrix data, set cutoff (“miss”) to 94%, if Illumina 98%
18Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Wings Workflow System Assists Users to Set Up Parameters Based on Characteristics of Datasets
PEDFile Data:
• genotype95.ped
• hapmap1.ped
• test.ped
Data Catalog
Component Catalog[MissingnessPerIndividual1: (?c rdf:type pcdom:Create_Binary_PEDFile_Class) (?c pc:hasInput ?idv1) (?idv1 pc:hasArgumentID "PEDFile") (?c pc:hasInput ?idv2) (?idv2 pc:hasArgumentID "MissingnessPerIndividual") (?idv1 dcdom:hasGenotypingRate ?v1) equal(?v1, "0.95"^^xsd:float) -> (?idv2 pc:hasValue "0.06"^^xsd:float)]
19Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Why Semantic Workflows:3) Automatic Generation of Metadata
Metadata annotations are tedious and involved• Often not done, an obstacle to sharing and to reuse
Semantic workflows can automate the generation of metadata for analysis data products• Representing expected characteristics of output dataset for each software component given the input metadata
• Representing metadata properties of input datasets Semantic reasoning needed:
• To propagate metadata for each workflow step • To propagate metadata across the workflow
20Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Wings Metadata Generation: An Example in a Seismic Hazard Workflow [Kim et al 06; Gil et al 07]
SeismogramGration
RVM
127_6.rvm- source_id: 127- rupture_id: 6
Rupture_variationRupture_variation
127_6.txt.variation-s0000-h0000- source_id: 127- rupture_id: 6- slip_relaization_#:0- hypo_center_#: 1
127_6.txt.variation-s0000-h0000- source_id: 127- rupture_id: 6- slip_relaization_#:0- hypo_center_#: 1
127_6.txt.variation-s0000-h0001- source_id: 127- rupture_id: 6- slip_relaization_#:0- hypo_center_#: 1
127_6.txt.variation-s0000-h0001- source_id: 127- rupture_id: 6- slip_relaization_#:0- hypo_center_#: 1
SGT
127_6.txt.variation-s0000-h0000- source_id: 127- rupture_id: 6- slip_relaization_#:0- hypo_center_#: 1
127_6.txt.variation-s0000-h0001- source_id: 127- rupture_id: 6- slip_relaization_#:0- hypo_center_#: 1
FD_SGT/PAS_1/A/SGT161- site_name: PAS- tensor_direction: 1- time_period: A- xyz_volumn_id: 161
127_6.txt.variation-s0000-h0001- source_id: 127- rupture_id: 6- slip_realization_#:0- hypo_center_#: 1
Seismogram
Seismogram_PAS_127_6.grm-site_name: PAS-source_id: 127-rupture_id: 6
… …SGT
21Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Wings Workflows for Accuracy/Quality Tradeoffs in Biomedical Image Analysis [Kumar et al 09]
PIQ: Pixel Intensity Quantification (from National Center for Microscopy and Imaging Research [Chow et al 06])• Terabyte-sized out-of-core
image data • Need to minimize execution time
while preserving highest output quality
• Some operations are parallelizable, others must operate on entire images
For efficiency, image decomposed (layers, tiles, and chunks) but quality is affected
From a workflow template, Wings can automatically generate descriptions of each individual piece of the image to manage the computations over each one
22Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Why Semantic Workflows:4) Discovery of Relevant Data
Need a dataset of updated
common (known) locito annotate findings, where can I find one?
23Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Why Semantic Workflows:5) Retrieval of Workflows
Hard to find workflows for the type of analysis a user wants• Semantic information is not provided when creating the workflow
– e.g., when user adds a NaiveBayesModeler, he wouldn’t be expected to define that the output of this would be a NaiveBayesModel or a Bayes Model (superclass) or not human readable
• However, retrieval queries are often based on metadata properties of data– e.g., “Find workflows that can normalize data which is continuous and has missing
values [<- constraints on inputs] to create a decision tree model [constraint on intermediate data products]”
Semantic representations are needed• For workflow constituents
– Metadata properties of input, intermediate and final data products– Metadata properties of workflow and component function
• For user queries– Express workflow sketches containing partial data descriptions (constraints)
Reasoning capabilities• Automatic creation of metadata for expected workflow data products• Workflow matching to queries (exact and partial)
24Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
User’s Difficulties: Choosing an Analysis
What type of analysis is appropriate for my data?
CNV Detection
Variant Discovery from Resequencing
Transmission Disequilibrium Test (TDT)
Association Test
TDT analysis requires no less than 100 families
Variant discovery is used for genomic
data from the same individual
Association tests are best for large datasets that are not within a family
25Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
User’s Difficulties: Choosing a Workflow
What workflow is appropriate for my goals?
Transmission Disequilibrium Test (TDT)
Association Test
Applies population stratification to remove outliers
Assumes outliers have been removed
Uses structured association
Uses a standard test
Incorporates parental phenotype information
Uses CMHassociation
26Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
An Algorithm for Semantic Enrichment of Workflow Templates [Gil et al K-CAP 09]
?Model5 dcdom:isDiscrete true?Model6 dcdom:isDiscrete true?Model7 dcdom:isDiscrete true
?TestData dcdom:isDiscrete true
?Dataset4 dcdom:isDiscrete true
?Dataset3 dcdom:isDiscrete true
?TrainingData dcdom:isDiscrete true
Model5 Model6 Model7
Problem Addressed: Semantic information is not provided when creating the workflow, but retrieval queries use it
Key idea: Constraints can be available in a component catalog and propagated through the workflow
Phase 1: Goal Regression• Starting from final
products, traverse workflow backwards
• For each node, query component catalog for metadata constraints on inputs
Phase 2: Forward Projection• Starting from input
datasets, traverse workflow forwards
• For each node, query component catalog for metadata constraints on outputs
27Yolanda Gil ([email protected])
USC Information Sciences Institute
February 4, 2010
Conclusions: Benefits of Semantic Workflows [Gil JSP-09]
Execution management: Automation of workflow execution
Managing distributed computation
Managing large data sets
Security and access control
Provenance recording Low-cost high fidelity reproducibility
Semantics and reasoning:
“Conceptual” reproducibility
User assistance to explore analysis “design space”
Validation of analyses
Automated generation of metadata
Workflow retrieval and discovery