Towards Semantic Typing Support for Scientific Workflows Bertram Ludäscher Knowledge-Based...

Post on 13-Jan-2016

222 views 0 download

Tags:

Transcript of Towards Semantic Typing Support for Scientific Workflows Bertram Ludäscher Knowledge-Based...

Towards Semantic Typing Support for Towards Semantic Typing Support for Scientific WorkflowsScientific Workflows

Bertram Ludäscher

Knowledge-Based Information Systems LabSan Diego Supercomputer CenterUniversity of California San Diego

http://seek.ecoinformatics.org http://www.geongrid.org

B. Ludäscher – Scientific Data Management 2

Outline

1. Motivation: Traditional vs Scientific Data Integration

2. Semantic (a.k.a. Model-Based) Mediation

3. Scientific Workflows (a.k.a. Analysis Pipelines)

4. DB Theory Appetizer: Web Service Composition Through Declarative Queries

B. Ludäscher – Scientific Data Management 3

Information Integration Challenges

• System aspects: “Grid” Middleware• distributed data & computing• Web Services, WSDL/SOAP, OGSA, …• sources = functions, files, data sets …

• Syntax & Structure: (XML-Based) Data Mediators

• wrapping, restructuring • (XML) queries and views• sources = (XML) databases

• Semantics: Model-Based/Semantic Mediators

• conceptual models and declarative views • Knowledge Representation: ontologies,

description logics (RDF(S),OWL ...)• sources = knowledge bases (DB+CMs+ICs)

SyntaxSyntax

StructureStructure

SemanticsSemantics

System aspectsSystem aspects

reconciling reconciling SS44 heterogeneitiesheterogeneities

““gluing” together gluing” together resources resources

bridging information and bridging information and knowledge gaps knowledge gaps computationallycomputationally

B. Ludäscher – Scientific Data Management 4

Information Integration from a DB Perspective

• Information Integration Problem– Given: data sources S1, ..., Sk (DBMS, web sites, ...)

and user questions Q1,..., Qn that can be answered using the Si

– Find: the answers to Q1, ..., Qn

• The Database Perspective: source = “database” Si has a schema (relational, XML, OO, ...) Si can be queried define virtual (or materialized) integrated/global

view G over S1 ,..., Sk using database query languages (SQL,

XQuery,...) questions become queries Qi against G(S1,..., Sk)

B. Ludäscher – Scientific Data Management 5

Standard (XML-Based) Mediator Architecture

MEDIATORMEDIATOR

Integrated Global(XML) View G

Integrated ViewDefinition

G(..) S1(..)…Sk(..)

USER/ClientUSER/Client

1. Query Q ( G (S1. Query Q ( G (S11,..., S,..., Skk) )) )

S1

Wrapper

(XML) View

S2

Wrapper

(XML) View

Sk

Wrapper

(XML) Viewweb services as wrapper APIs

3. Q1 Q2 Q33. Q1 Q2 Q3

4. {answers(Q1)} {answers(Q2)} {answers(Q3)}4. {answers(Q1)} {answers(Q2)} {answers(Q3)}

6. {answers(Q)}6. {answers(Q)}

B. Ludäscher – Scientific Data Management 6

Query Planning for Mediators

• Given: – User query Q: answer(…) …G ...– … & { G … S … } global-as-view (GAV)– … & { S … G … } local-as-view (LAV)– … & { false … S … G… } integrity constraints (ICs)

• Find: – equivalent (or min. containing, max.contained) query

plan Q’: answer(…) … S … • Results:

– A variety of results/algorithms; depending on classes of queries, views, and ICs: P, NP,…, undecidable

– many variants still open

B. Ludäscher – Scientific Data Management 7

From Scientific Data Integration to Process & Application Integration (and back…)• Data Integration

– Database mediation + Knowledge-based extension Query rewriting w/ GAV, LAV, ICs, access patterns

• “Process/Application”Integration– Scientific models (ocean, atmosphere, ecology, …),

assimilation models (e.g., real-time data feeds), …– Data sets– Legacy tools Components = web services Applications = composite components

(“workflows”) Need for semantic type extensions

B. Ludäscher – Scientific Data Management 8

Geologic Map Integration

• Given: – Geologic maps from different state geological surveys

(shapefiles w/ different data schemas)– Different ontologies:

• Geologic age ontology• Rock type ontologies:

– Multiple hierarchies (chemical, fabric, texture, genesis) from Geological Survey of Canada (GSC)

– Single hierarchy from British Geological Survey (BGS)

• Problem– Support uniform queries against the multiple geologic

maps using different ontologies– Support registration w/ ontology A, querying w/ ontology

B

B. Ludäscher – Scientific Data Management 9

Ontology Mappings: Motivation

• Establish correspondences between ontologies Integrate data sets which are registered to different

ontologies Query data sets through different ontologies

Data set 1

Data set 2

Ontology A

Ontology B

register

register

Ontology mappings queries

B. Ludäscher – Scientific Data Management 10

A Multi-Hierarchical Rock Classification Ontology (GSC)

Composition

Genesis

Fabric

Texture

B. Ludäscher – Scientific Data Management 11

Some enabling operations on “ontology data”

Composition

Concept expansion:Concept expansion:• what else to look for what else to look for when asking for ‘Mafic’when asking for ‘Mafic’

B. Ludäscher – Scientific Data Management 12

Some enabling operations on “ontology data”

Composition

Generalization:Generalization:• finding data that is finding data that is “like” X and Y“like” X and Y

B. Ludäscher – Scientific Data Management 13

Implementation in OWL: Not only “for the machine” …

Geologic Map Integration

domainknowledge

domainknowledge

Knowledge r

epresentatio

n

Ontologies!?

NevadaNevada

Geoscientists + Computer Scientists Igneous Geoinformaticists+/- Energy

GEON Metamorphism Equation:

+/- a few hundred million years

B. Ludäscher – Scientific Data Management 16

Geology Workbench: Registering Data to an OntologyStep 1: Choose Classes

Click on Submission Data set name

Select a shapefile

Choose an ontology class

B. Ludäscher – Scientific Data Management 17

Geology Workbench: Data RegistrationStep 2: Choose Columns for Selected Classes

AREA

PERIMETER

AZ_1000

AZ_1000_ID

GEO

PERIOD

ABBREV

DESCR

D_SYMBOL

P_SYMBOL

It contains information about geologic age

B. Ludäscher – Scientific Data Management 18

Geology Workbench: Data RegistrationStep 3: Resolve Mismatches

Two terms arenot matched anyontology terms

Manually mappingalgonkian intothe ontology

B. Ludäscher – Scientific Data Management 19

Geology Workbench: Ontology-enabled Map Integrator

Click on the nameChoose interesting

Classes

All areas with the age Paleozoic

B. Ludäscher – Scientific Data Management 20

Geology Workbench: Change Ontology

Submit a mapping

Ontology mappingbetween British Rock

Classification and CanadianRock Classification

Switch from Canadian Rock Classification to

British Rock Classification

Run it New query interface

B. Ludäscher – Scientific Data Management 22

Ontologies and Data Management

Schema Schema Schema Schema

ConceptualModel

ConceptualModel

Ontology

Data

Metadata

DesignArtifact

use concepts from(explicitly or implicitly)

• How to define and refine an ontology?• How to register a dataset to an ontology?

B. Ludäscher – Scientific Data Management 23

Biomedical InformaticsResearch Networkhttp://nbirn.net

Biomedical InformaticsResearch Networkhttp://nbirn.net

Refining an Ontology – the logic way, enables “Source Contextualization”

B. Ludäscher – Scientific Data Management 24

Connecting Datasets to Ontologies:“Semantic Registration”

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

DataCollectionEventMeasurement

MeasurementContextMeasurableItem

SpeciesCountSpeciesAbundance

AbundanceCollectionEventLocation

LTERSiteSBLTERSite

{naples,…}

⊑ contains.Measurement⊑ measureOf.MeasurableItem ⊓ hasContext.MeasurementContext

⊑ hasTime.DateTime ⊓ hasLocation.Location ⊑ hasUnit.Unit ⊓ hasValue.UnitValue ⊑ MeasurableItem ⊓ hasSpecies.Species ⊓ hasUnit.RatioUnit

… ⊑ Measurement ⊓ measureOf.SpeciesCount ⊑ DataCollectionEvent ⊓ contains.SpeciesAbundance ⊑ position.Coordinate ⊑ Location ⊑ LTERSite ⊓ position.SBLTERCoordinate ⊑ SBLTERSite

How can we “register”the dataset to concepts in the Ontology?

Ontology (snippet)

Dataset

B. Ludäscher – Scientific Data Management 25

Purpose of Semantic Registration

Expose “hidden” information:– What do attributes represent? – What do specific values represent? – What conceptual “objects” are in the dataset?

Capture connections between the dataset and ontology to:– Find existing datasets (or parts of datasets) via

ontological concepts (discovery)– Enable integration of datasets (mediation)– Generate metadata for new data products (in a

pipeline)

B. Ludäscher – Scientific Data Management 26

Semantic Registration Framework

Step 1: Data provider selects relevant ontological concepts (for the dataset)

Step 2: The semantic registration system creates a structural representation based on chosen concepts (data provide refines if needed)

Step 3: The data provider maps the dataset information to the generated structural representation

B. Ludäscher – Scientific Data Management 27

Step1: Selecting Relevant Concepts

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

Concepts from an Ontology

Dataset

• DataCollectionEvent• AbundanceCollectionEvent

• Measurement• Abundance

• SpeciesAbundance

• MeasurableItem• SpeciesCount

• Location• LTERSite

• SBLTERSite• naples

• Species• …

• MeasurementContext• …

B. Ludäscher – Scientific Data Management 28

Step1: Selecting Relevant Concepts

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

Concepts from an Ontology

Dataset

• DataCollectionEvent• AbundanceCollectionEvent

• Measurement• Abundance

• SpeciesAbundance

• MeasurableItem• SpeciesCount

• Location• LTERSite

• SBLTERSite• naples

• Species• …

• MeasurementContext• …

B. Ludäscher – Scientific Data Management 29

Step2: Generate Object ModelConcepts from an Ontology

AbundanceCollection Event

SpeciesAbundance

containsSpeciesCount

measureOf

Species

hasSpecies

RatioUnit

hasUnit

RatioValue

hasValue

DateTime SBLTERSite

hasTime hasLoc

• DataCollectionEvent• AbundanceCollectionEvent

• Measurement• Abundance

• SpeciesAbundance

• MeasurableItem• SpeciesCount

• Location• LTERSite

• SBLTERSite• naples

• Species• …

• MeasurementContext• …

B. Ludäscher – Scientific Data Management 30

B. Ludäscher – Scientific Data Management 31

B. Ludäscher – Scientific Data Management 32

Scientific Workflows

B. Ludäscher – Scientific Data Management 34

Promoter Identification Workflow (PIW)

Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)

B. Ludäscher – Scientific Data Management 35

Source: NIH BIRN (Jeffrey Grethe, UCSD)Source: NIH BIRN (Jeffrey Grethe, UCSD)

B. Ludäscher – Scientific Data Management 36

Ecology: GARP Analysis Pipeline for Invasive Species Prediction

Training sample

(d)

GARPrule set

(e)

Test sample (d)

Integrated layers

(native range) (c)

Speciespresence &

absence points(native range)

(a)EcoGridQuery

EcoGridQuery

LayerIntegration

LayerIntegration

SampleData

+A3+A2

+A1

DataCalculation

MapGeneration

Validation

User

Validation

MapGeneration

Integrated layers (invasion area) (c)

Species presence &absence points

(invasion area) (a)

Native range

predictionmap (f)

Model qualityparameter (g)

Environmental layers (native

range) (b)

GenerateMetadata

ArchiveTo Ecogrid

RegisteredEcogrid

Database

RegisteredEcogrid

Database

RegisteredEcogrid

Database

RegisteredEcogrid

Database

Environmental layers (invasion

area) (b)

Invasionarea prediction

map (f)

Model qualityparameter (g)

Selectedpredictionmaps (h)

Source: NSF SEEK (Deana Pennington et. al, UNM)Source: NSF SEEK (Deana Pennington et. al, UNM)

B. Ludäscher – Scientific Data Management 37

Scientific Workflows: Some Findings

• More dataflow than (business) workflow• Need for “programming extension”

– Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …)

• Need for abstraction and nested workflows• Need for data transformations • Need for rich user interaction & workflow steering:

– pause / revise / resume– select & branch; e.g., web browser capability at specific steps

as part of a coordinated SWF• Need for high-throughput transfers (“grid-enabling”,

“streaming”)• Need for persistence of intermediate products

data provenance (“virtual data” concept)

Our Starting Point: Dataflow Process Networks and Ptolemy II

see!see!see!see!

try!try!try!try!

read!read!read!read!

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

B. Ludäscher – Scientific Data Management 39

Kepler Team, Projects, Sponsors

• Ilkay Altintas SDM • Chad Berkley SEEK • Shawn Bowers SEEK• Jeffrey Grethe BIRN• Christopher H. Brooks Ptolemy II • Zhengang Cheng SDM • Efrat Jaeger GEON • Matt Jones SEEK • Edward A. Lee Ptolemy II • Kai Lin GEON• Ashraf Memon GEON• Bertram Ludaescher BIRN, GEON, SDM, SEEK• Steve Mock NMI• Steve Neuendorffer Ptolemy II • Mladen Vouk SDM • Yang Zhao Ptolemy II • …

Ptolemy IIPtolemy II

                                                

                                            

B. Ludäscher – Scientific Data Management 40

Commercial Workflow/Dataflow Systems

B. Ludäscher – Scientific Data Management 41

SCIRun: Problem Solving Environments for Large-Scale Scientific Computing

• SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations

• Component model, based on generalized dataflow programming

Steve Parker (cs.utah.edu)Steve Parker (cs.utah.edu)

B. Ludäscher – Scientific Data Management 42

E-Science and Link-Up Buddies

• … <UPDATE ME> …– Taverna, Scufl, Freefluo, ..– DiscoveryNet– Triana– ICENI– …

B. Ludäscher – Scientific Data Management 43

Dataflow Process Networks:Putting Computation Models first!

• Synchronous Dataflow Network (SDF)– Statically schedulable single-threaded dataflow

• Can execute multi-threaded, but the firing-sequence is known in advance– Maximally well-behaved, but also limited expressiveness

• Process Network (PN)– Multi-threaded dynamically scheduled dataflow– More expressive than SDF (dynamic token rate prevents static

scheduling)– Natural streaming model

• Other Execution Models (“Domains”)– Implemented through different “Directors”

actor actor

typed i/o ports

FIFO

advanced push/pull

B. Ludäscher – Scientific Data Management 44

Promoter Identification Workflow (PIW)

Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)

B. Ludäscher – Scientific Data Management 45

Promoter Identification

Workflowin Ptolemy-II[SSDBM’03]

ExecutionSemantics

B. Ludäscher – Scientific Data Management 46

hand-crafted control solution; also: forces sequential execution!

designed to fit

designed to fit

hand-craftedWeb-service

actor

Complex backward control-flow

No data transformations

available

B. Ludäscher – Scientific Data Management 47

Simplified Process Network PIW

• Back to purely functional dataflow process network(= a data streaming

model!)• Re-introducing map(f) to

Ptolemy-II (was there in PT Classic) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go

from piw(GeneId) to PIW :=map(piw) over [GeneId]

map(f)-style

iterators Powerful type

checking Generic,

declarative “programming”

constructs

Generic data transformation

actors

Forward-only, abstractable sub-workflow piw(GeneId)

B. Ludäscher – Scientific Data Management 48

Optimization by Declarative Rewriting

• PIW as a declarative, referentially transparent functional process optimization via functional

rewriting possiblee.g. map(f o g) = map(f) o

map(g)

• Details: – Technical report &PIW

specification in Haskell

map(f o g) instead of map(f) o

map(g)

Combination of map and zip

http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf

B. Ludäscher – Scientific Data Management 49

Web Services & Scientific Workflows in Kepler

• Web services = individual components (“actors”)• “Minute-Made” Application Integration:

– Plugging-in and harvesting web service components is easy and fast

• Rich SWF modeling semantics (“directors” and more):– Different and precise dataflow models of computation– Clear and composable component interaction semantics Web service composition and application integration tool

• Coming soon:– Shrinked wrapped, pre-packaged “Kepler-to-Go” (v0.8)– SWFs with structural and semantic data types (better design

support)– Grid-enabled web services (for big data, big computations,…) – Different deployment models (SWF WS, web site, applet, …)

B. Ludäscher – Scientific Data Management 50

KEPLER Core Capabilities (1/2)

• Designing scientific workflows– Composition of actors (tasks) to perform a scientific WF

• Actor prototyping• Accessing heterogeneous data

– Data access wizard to search and retrieve Grid-based resources– Relational DB access and query– Ability to link to EML data sources

B. Ludäscher – Scientific Data Management 51

KEPLER Core Capabilities (2/2)

• Data transformation actors to link heterogeneous data

• Executing scientific workflows– Distributed and/or local computation– Various models for computational semantics and

scheduling– SDFSDF and PNPN: Most common for scientific workflows

• External computing environments:– C++, Python, C (… Perl--planned ...)

• Deploying scientific tasks and workflows as web services themselves (… planned …)

B. Ludäscher – Scientific Data Management 52

The KEPLER GUI (Vergil)

Drag and drop utilities, director and actor libraries.

B. Ludäscher – Scientific Data Management 53

Running the workflow

B. Ludäscher – Scientific Data Management 54

Distributed SWFs in KEPLER

• Web and Grid Service plug-ins– WSDL, and whatever comes after GWSDL– ProxyInit, GlobusGridJob, GridFTP, DataAccessWizard

• WS Harvester– Imports all the operations of a specific WS (or of all the WSs in a UDDI repository) as Kepler actors

• WS-deployment interface (…ongoing work…)• XSLT and XQuery transformers to link non-fitting

services together

B. Ludäscher – Scientific Data Management 55

A Generic Web Service Actor

Given a WSDL and the name of an operation of a web service, dynamically customizes itself to implement and execute that method.

Configure - select service operation

B. Ludäscher – Scientific Data Management 56

Set Parameters and Commit

Set parameters and commit

B. Ludäscher – Scientific Data Management 57

WS Actor after Instantiation

B. Ludäscher – Scientific Data Management 58

Web Service Harvester

• Imports the web services in a repository into the actor library.• Has the capability to search for web services based on a keyword.

B. Ludäscher – Scientific Data Management 59

Composing 3rd-Party WSs

Output of previousweb service

User interaction &Transformations

Input of next web service

B. Ludäscher – Scientific Data Management 60

Classifying with Kepler

B. Ludäscher – Scientific Data Management 61

Classifying with Kepler

B. Ludäscher – Scientific Data Management 62

B. Ludäscher – Scientific Data Management 63

SWF Designed in Kepler

B. Ludäscher – Scientific Data Management 64

Result launched via the BrowserUI actor

Querying Example

B. Ludäscher – Scientific Data Management 66

KEPLER and YOU

• Kepler …– is a community-based, cross-

project, open source collaboration

– uses web services as basic building blocks

– has a joint CVS repository, mailing lists, web site, …

– is gaining momentum thanks to contributors and contributions

• BSD-style license allows commercial spin-offs

– a pre-packaged, shrink-wrapped version (“Kepler-to-GO”) coming soon to a place near you…

Now back to the “Semantics Stuff”

B. Ludäscher – Scientific Data Management 68

Semantic Types for Scientific Workflows

B. Ludäscher – Scientific Data Management 69

From Semantic to Structural Mappings

B. Ludäscher – Scientific Data Management 70

Structural and Semantic Mappings

B. Ludäscher – Scientific Data Management 71

• Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,..• Goals: global access to ecologically relevant data; rapidly locate and

utilize distributed computation; (semi-)automate, streamline analysis process – “Knowledge Discovery Workflows”

Summary I: Putting it all together for the Summary I: Putting it all together for the Science Environment for Ecological Science Environment for Ecological KnowledgeKnowledge

ASx ASy ASzTS1TS2

Semantic MediationEngine

Data Binding

Query Processing

ECO2

Logic Rules ECO2-CL

Analytical Pipeline (AP)

SemanticMediation System (SMS)

EcoGrid

provides unified access to Distributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via the Execution Environment

Semantic Mediation System & Analysis and Modeling System use WSDL/UDDI to access services within theEcoGrid, enabling analytically driven data discovery and integration

SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities

AM: Analysis & Modeling System (KEPLER)

ASr

Parameters w/ Semantics

CC

C

CC

CParameterOntologies

WSDL/UDDI WSDL/UDDI

SRB KNB

MC

Species

WrpDar

...

Raw data setswrappedfor integrationw/ EML, etc.

ECO2 TaxOn

EML

etc.

Execution Environment

SAS, MATLAB,FORTRAN, etc

Library of Analysis Steps, Pipelines& Results

Invasive speciesover time

ASr

WSDL/UDDI

Example of “AP0”

AP0

B. Ludäscher – Scientific Data Management 72

Outline

1. Motivation: Traditional vs Scientific Data Integration

2. Semantic (a.k.a. Model-Based) Mediation

3. Scientific Workflows (a.k.a. Analysis Pipelines)

4. DB Theory Appetizer: Web Service Composition Through Declarative Queries

B. Ludäscher – Scientific Data Management 73

Planning with Limited Access Patterns(back to GAV mediation …) • User query Q: answer(ISBN, Author, Title)

book(ISBN, Author, Title),catalog(ISBN, Author),not library(ISBN).

• Limited (web service) APIs (access patterns):– Src1.books: in: ISBN out: Author, Title– Src1.books: in: Author out: ISBN, Title– Src2.catalog: in: {} out: ISBN, Author– Src3.library: in: {} out: ISBN

• Note: Q is not executable, but feasible (equivalent to executable Q’: catalog ; book ; not library)

B. Ludäscher – Scientific Data Management 74

Query Feasibility is as hard as Containment

• Theorem [EDBT’04]: For UCQneg queries Q:Q is feasible iff ans(Q) Q

• The answerable part ans(Q) can be computed in quadratic time. Idea: scan Q for answerable literals, rescan, repeat until ans(Q) is reached

• Checking query containment Q1 Q2 is hard:– Already NP-complete for CQ (conjunctive queries)– Undecidable for FO (first-order logic queries)

B. Ludäscher – Scientific Data Management 75

Conjunctive Query Containment

• Given: conjunctive queries Q1, Q2 (aka Select-Project-Join queries)• Problem: Is answers(D, Q1) answers(D, Q2) for all databases D?• If yes, we say that “Q1 is contained in Q2”; short: Q1 Q2• Examples:

Q1: answer(X) student(X, cs)Q2: answer(X) student(X,Dept), advisor(X,Y), dept(Y,cs)Q3: answer(X) student(X,Dept)

• Quiz: – Q1 Q2 ?– No: not every student X necessarily has an adviser Y who is in the

cs department!– Q1 Q3 ?– Yes: every cs student is student in some department (crux of the “proof”: Dept = cs)Homework: What about Q1 Q2 if we know that every student must

have an advisor from the same department?

B. Ludäscher – Scientific Data Management 76

The World’s Shortest Conjunctive Query Containment Checker (an NP-complete problem): 7 lines in Prolog …

Quiz: 1. find the bug in the 7 lines of code2. Fix the bug (hint: add one more line of code)

Moral: Short programs can be buggy too

B. Ludäscher – Scientific Data Management 77

Summary II: Got milk/eggs/meat/wool?Or: “Die eierlegende Wollmilchsau …”

• Data Integration– query rewriting under GAV/LAV – w/ binding pattern constraints– distributed query processing

• Semantic Mediation– semantic integrity constraints, reasoning w/ plans,

automated deduction– deductive database/logic programming technology, AI

“stuff”...– Semantic Web technology

• Scientific Workflow Management– more procedural than database mediation (the scientist is

the “query planner”)– deployment using web services

B. Ludäscher – Scientific Data Management 78

• Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,..• Goals: global access to ecologically relevant data; rapidly locate and

utilize distributed computation; (semi-)automate, streamline analysis process – “Knowledge Discovery Workflows”

Science Environment for Science Environment for Ecological KnowledgeEcological Knowledge

ASx ASy ASzTS1TS2

Semantic MediationEngine

Data Binding

Query Processing

ECO2

Logic Rules ECO2-CL

Analytical Pipeline (AP)

SemanticMediation System (SMS)

EcoGrid

provides unified access to Distributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via the Execution Environment

Semantic Mediation System & Analysis and Modeling System use WSDL/UDDI to access services within theEcoGrid, enabling analytically driven data discovery and integration

SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities

AM: Analysis & Modeling System (KEPLER)

ASr

Parameters w/ Semantics

CC

C

CC

CParameterOntologies

WSDL/UDDI WSDL/UDDI

SRB KNB

MC

Species

WrpDar

...

Raw data setswrappedfor integrationw/ EML, etc.

ECO2 TaxOn

EML

etc.

Execution Environment

SAS, MATLAB,FORTRAN, etc

Library of Analysis Steps, Pipelines& Results

Invasive speciesover time

ASr

WSDL/UDDI

Example of “AP0”

AP0

B. Ludäscher – Scientific Data Management 79

Building the EcoGrid

AND

LUQ

HBR

NTL

Metacat node

Legacy system

LTER Network (24) Natural History Collections (>> 100)Organization of Biological Field Stations (180)UC Natural Reserve System (36)Partnership for Interdisciplinary Studies of Coastal Oceans (4)Multi-agency Rocky Intertidal Network (60)

SRB node

DiGIR node

VCR

VegBank node

Xanthoria node

Source: Matthew Jones (UCSB)Source: Matthew Jones (UCSB)

B. Ludäscher – Scientific Data Management 80

Heterogeneous Data integration

• Requires advanced metadata and processing

– Attributes must be semantically typed– Collection protocols must be known– Units and measurement scale must be known– Measurement relationships must be known

• e.g., that ArealDensity=Count/Area

B. Ludäscher – Scientific Data Management 81

Ecological ontologies

• What was measured (e.g., biomass)• Type of measurement (e.g., Energy)• Context of measurement (e.g., Psychotria limonensis)• How it was measured (e.g., dry weight)

• SEEK intends to enable community-created ecological ontologies using OWL– Represents a controlled vocabulary for ecological metadata

• More about this in Bertram’s talk

B. Ludäscher – Scientific Data Management 82

• Label data with semantic types (e.g. concept expressions in OWL)

• Label inputs and outputs of analytical components with semantic types

• Use reasoning engines to generate transformation steps– Observe analytical constraints

• Use reasoning engine to discover relevant components

Semantic Mediation

Data Ontology Workflow Components