Managing Scientiﬁc Data: From Data Integration to...

Managing Scientific Data:

From Data Integration to Scientific Workflows∗

Bertram Ludascher† Kai Lin† Shawn Bowers† Efrat Jaeger-Frank†

Boyan Brodaric‡ Chaitan Baru†

Abstract

Scientists are confronted with significant data-management problems due to the large volume andhigh complexity of scientific data. In particular,the latter makes data integration a difficult techni-cal challenge. In this paper, we describe our work onsemantic mediation and scientific workflows, and dis-cuss how these technologies address integration chal-lenges in scientific data management. We first give anoverview of the main data-integration problems thatarise from heterogeneity in the syntax, structure, andsemantics of data. Starting from a traditional me-diator approach, we show how semantic extensionscan facilitate data integration in complex, multiple-worlds scenarios, where data sources cover differentbut related scientific domains. Such scenarios arenot amenable to conventional schema-integration ap-proaches. The core idea of semantic mediation is toaugment database mediators and query evaluation al-gorithms with appropriate knowledge-representationtechniques to exploit information from shared on-tologies. Semantic mediation relies on semantic dataregistration, which associates existing data with se-mantic information from an ontology. The Ke-pler scientific workflow system addresses the prob-lem of synthesizing, from existing tools and applica-tions, reusable workflow components and analyticalpipelines to automate scientific analyses. After pre-senting core features and example workflows in Ke-pler, we present a framework for adding semanticinformation to scientific workflows. The resulting sys-tem is aware of semantically plausible connections be-tween workflow components as well as between datasources and workflow components. This informationcan be used by the scientist during workflow design,and by the workflow engineer for creating data trans-formation steps between semantically compatible butstructurally incompatible analytical steps.

∗Work supported by NSF/ITR 0225673 (GEON), NSF/ITR0225676 (SEEK), NIH/NCRR 1R24 RR019701-01 BiomedicalInformatics Research Network (BIRN-CC), and DOE SciDACDE-FC02-01ER25486 (SDM)

†San Diego Supercomputer Center, University of California,San Diego, ludaesch,lin,bowers,efrat,[email protected]

‡Natural Resources of Canada, [email protected]

Contents

1 Introduction 2

2 Integration Challenges 3

3 Integration Examples 43.1 Geologic-Map Data Integration . . . . 43.2 Mineral Classification Workflow . . . . 6

4 Data Integration 74.1 Mediator Approach . . . . . . . . . . . 74.2 Semantic Mediation . . . . . . . . . . 84.3 Semantic Data Registration . . . . . . 11

5 Scientific Workflows 155.1 Scientific Workflows in Kepler . . . . 155.2 Gravity Modeling Workflow . . . . . . 155.3 Semantic Workflow Extensions . . . . 17

6 Conclusions 19

1

1 INTRODUCTION 2

1 Introduction

Information technology is revolutionizing the waymany sciences are conducted, as witnessed bynew techniques, results, and discoveries in multi-disciplinary and information-driven fields such asbioinformatics, ecoinformatics, and geoinformat-ics. The opportunities provided by these newinformation-driven and often data-intensive sciencesalso bring with them equally large challenges for sci-entific data management. For example, to answer aspecific scientific question or to test a certain hypoth-esis, a scientist today not only needs profound do-main knowledge, but also may require access to dataand information provided by others via communitydatabases or analytical tools developed by commu-nity members.

A problem for the scientist is how to easily makeuse of the increasing number of databases, analyti-cal tools, and computational services that are avail-able. Besides making these items generally accessibleto scientists, leveraging these resources requires tech-niques for data integration and system interoperabil-ity. Traditionally, research by the database commu-nity in this area has focused on problems of hetero-geneous systems, data models, and schemas [She98].However, the integration scenarios considered differsignificantly from those encountered in scientific dataintegration today. In particular, the former usuallyinvolve “one-world” scenarios, where the goal is toprovide an integration schema or integrated view overmultiple sources having a single, conceptual domain.Online comparison shopping for cheapest books isan example of a one-world scenario: the differentdatabase schemas or web services to be integratedall deal with the same book attributes (e.g., title, au-thors, publishers, and price).

Compare this situation to the scientificinformation-integration scenario depicted in Fig-ure 1. A scientist (here, an igneous petrologist) isinterested in the distribution of a certain rock type(say A-type plutons) within a specific region. He alsowants to know the 3D geometry of those plutons andunderstand their relation to the host rock structures.Through databases and analytical tools, our scientistcan gather valuable information towards answeringtheir scientific question. For example, geologic mapsof the region, geophysical databases with gravitycontours, folation maps, and geochemical databasesall provide pieces of information that need to bebrought together in an appropriate form to be usefulto the scientist (see Figure 1).

We call this integration example a complexmultiple-worlds scenario [LGM03]. In particular, it is

not possible or meaningful to integrate the schemasof the data sources into a single common schema be-cause the various data sources contain disjoint in-formation. However, there are often latent linksand connections between these data sources that aremade by scientists. Through these implicit knowl-edge structures, the various pieces of information canbe “glued” together to help answer scientific ques-tions. Making explicit these knowledge structures–i.e. the ”glue”–is therefore a prerequisite for con-necting the underlying data. In scientific domains,ontologies can be seen as the formal representationof such knowledge, to be used for data and knowl-edge integration purposes. The problem in the geo-science domain, however, is that most of this knowl-edge is either implicit or represented in narrative formin textbooks, and thus not readily available for usein mediation systems as required.

Fig. 1: Complex “multiple-worlds” integration.

As a simple example, consider a geologic map anda geochemical database (see the bottom left of Fig-ure 1). We can link these sources thematically by es-tablishing associations (i) between the formations inthe geologic map and their constituent rock types, asindicated in the map legend and associated reports,and likewise (ii) between rock types and their mineralcompositions, using e.g. the spatial overlap of forma-tions and geochemical samples. By making these as-sociations explicit, and recording them in domain on-tologies, we can establish interesting linkages in com-plex multiple-worlds scenarios [LGM01, LGM03].

In this paper, we focus on two important aspectsof scientific data management: data integration andscientific workflows. For the former, we presenta framework that extends the traditional mediator-based approach to data integration with ontologies toyield a semantic-mediation framework for scientificdata integration. In this framework, ontologies are

2 INTEGRATION CHALLENGES 3

used to bridge the gap between “raw” data stored indatabases, and the knowledge level at which domainscientists work, trying to answer the given scientificquestions (see the left side of Figure 1).

Data integration and knowledge-based extensionssuch as semantic mediation deal with modeling andquerying database systems as opposed to the in-teroperation of analytical tools, or the assembly ofdata sources and computational services into largerscientific workflows. As we will illustrate below,scientific workflows, however, can also benefit fromknowledge-based extensions. In fact, an impor-tant goal of building cyberinfrastructure for scientificknowledge discovery is to devise integrated problem-solving environments that bring to the scientist’sdesktop the combined power of remote data sources,services, and computational resources from the Grid[FK99, Fos03, BFH03].

The rest of this paper is organized as follows. InSection 2 we provide a high-level overview of infor-mation integration and interoperability challenges inscientific data management. We then present two in-tegration examples in Section 3. The first exampleillustrates the use of ontologies in scientific data inte-gration using geologic maps (Section 3.1). Additionalknowledge represented in a geologic-age or rock-typeontology is used to create different conceptual viewsof geologic maps and allows the user to query theinformation in novel ways. Section 3.2 illustrates sci-entific process integration and combines data-access,data-analysis, and visualization steps into a scientificworkflow.

In Section 4 we give an overview of data-integrationapproaches and discuss technical issues surroundingsemantic mediation. Section 5 gives an introductionto scientific workflows and a brief overview of the Ke-pler system. In Section 5.3 we show that semanticextensions not only benefit data integration, but canalso provide new opportunities and challenges for sci-entific workflows. We conclude with a brief summaryof our work and findings in Section 6.

2 Integration Challenges

In this section we provide a high-level overviewof the data-integration and system-interoperabilitychallenges that often await a scientist who wants toemploy IT infrastructure for scientifc knowledge dis-covery. Many of these challenges arise from hetero-geneities that occur across systems, data sources, andservices that make up a scientific data managementinfrastructure.

Data heterogeneity has traditionally been di-

SynthesisSynthesis

SemanticsSemantics

StructureStructure

SyntaxSyntax

System IntegrationSystem Integration

Fig. 2: Interoperability challenges.

vided into syntax, structure, and semantic differ-ences [She98, VSSV02]. Scientific data analysis andinformation-integration scenarios like the one de-picted in Figure 1 often involve additional levels,which include low-level system integration issues aswell as higher-level synthesis issues. We briefly dis-cuss these various levels of heterogeneities and inter-operability challenges below (see Figure 2).

System aspects. By system aspects we mean dif-ferences in low-level issues relating to, e.g., networkprotocols1 (e.g., http, ftp, GridFTP, SOAP), plat-forms (operating systems), remote execution meth-ods (e.g., web services, RMI, CORBA), and autho-rization and authentication mechanisms (e.g., Ker-beros, GSI). Many efforts for cyberinfrastructureaim at facilitating system interoperability by pro-viding a common Grid middleware infrastructure[FK99, Fos03, BFH03].

System entities requiring management include cer-tificates, file handles, and resource identifiers.

Syntax. Data that is not stored in databases, orthat is exchanged between applications can come indifferent representations (e.g., raster or vector) andfile formats (e.g., netCDF, HDF, and ESRI shape-file).2 The use of XML as a uniform exchange syntaxcan help resolve syntax differences, but many special-ized formats prevail for practical reasons (e.g., be-cause of tool support and efficiency).

Syntactic entities requiring management includeindividual files in various formats and format con-version tools.

Structure. Structure differences can arise whensimilar data is represented using different schemas.

1These are organized into a layered stack of communicationprotocols themselves, e.g., the TCP/IP stack and the ISO/OSIlayered architecture.

2There are also different database representations, or datamodels, but most scientific applications today use relationaldatabase systems.

3 INTEGRATION EXAMPLES 4

The problem of schema integration is a heavily stud-ied area in databases [She98]. There are several dif-ficulties to overcome, e.g., how to derive an integra-tion schema, how to find and define the right map-pings between source schemas and the integrationschema, how to efficiently evaluate queries againstvirtual schemas (i.e., which are not materialized), andhow to deal with incomplete and conflicting informa-tion. In Section 4.1 we give a brief overview of themediator approach, which addresses some of these is-sues.

Structure entities requiring management includedatabase schemas and queries (e.g., in SQL orXQuery).

Semantics. Storing scientific data in a databasesystem provides solutions to a number of techni-cal challenges (e.g., multi-user access, transactioncontrol) and simplifies others (e.g., query executiontime can be improved and certain structural hetero-geneities can be resolved by defining queries thatmap between schemas, called views). However, lan-guages for defining database schemas are not expres-sive enough to adequately capture most semantic as-pects of data. For example, information about thekinds of objects being stored, and how those objectsrelate to each other or to general concepts of the do-main cannot be easily or adequately expressed. Someconceptual-level information can be captured duringformal database design (e.g., via an ER conceptualmodel [Che76]). But this information is rarely con-nected explicitly to the database, and thus, it cannotbe used to query for data and is not otherwise usableby a database system. Traditional metadata can pro-vide a limited form of data semantics and can help ascientist understand the origin, scope, and context ofa dataset.

However, to assess the applicability and integratedifferent datasets for a certain study, many possiblesemantic heterogeneities among the datasets have tobe considered: What were the measurement or exper-imentation parameters? What protocols were used?What is known about the accuracy of data? Andmost fundamentally, What concepts and associationsare encoded by the data? The usability of data canbe improved significantly by making explicit what isknow about a dataset (i.e., by describing its seman-tics), and doing it in such a form that automatedprocessing is facilitated.

Semantic entities requiring management includeconcepts and relationships from one or more ontolo-gies (e.g., using a standard language such as OWL[DOS04, Usc98, OWL03] or through controlled vo-cabularies) and data annotations to these ontologies.

Synthesis. By synthesis we mean the problem ofputting together databases, including semantic exten-sions, queries and transformations, and other compu-tational services into a scientific workflow. The prob-lem of synthesis of such workflows encompasses allprevious challenges. For example, if a scientist wantsto put together two processing steps A and B into thefollowing simple analysis pipeline

x→ Ad−→ B

y→

many questions arise: In what format does A expectits input x? Does the output d of A directly “fit”the format of B, or is a data transformation neces-sary? In addition to these syntactic and structuralheterogeneities, system and semantic issues exist aswell. For example, what mechanism should be usedto invoke processes and how should data be shippedfrom A to B? Is it meaningful and valid to connect Aand B in this way?

The main challenges for synthesis include processcomposition and the modeling, design, and executionof reusable process components and scientific work-flows.

3 Integration Examples

In this section we provide two different examples, il-lustrating new capabilities of data integration usingsemantic extensions (Section 3.1), and of process in-tegration via scientific workflows (Section 3.2). Theunderlying technologies are discussed in Section 4 andSection 5, respectively.

3.1 Geologic-Map Data Integration

Geologic maps present information about the historyand character of rock bodies, their intersection withthe Earth’s surface, and their formative processes.Geologic maps are an important data product formany geoscientific investigations. In the following, wedescribe the Ontology-enabled Map Integrator (OMI)system prototype [LL03, LLB+03, LLBB03] that wasbuilt as one of the first scientific data integration sys-tems in the GEON project [GEO]. The goal of thesystem is to integrate information from a number ofgeologic maps from different state geological surveysand to provide a uniform interface to query the in-tegrated information. Queries are “conceptual,” i.e.,they are expressed using terms from a shared ontol-ogy. For example, it is possible to query the inte-grated data using concepts such as geologic age androck type as opposed to the terms used to describethese attributes in the underlying data. For the OMI


prototype, data from nine different states were inte-grated. The state geologic maps were initially givenas ESRI shapefiles [ESR98] (thus, there were no syn-tactic heterogeneities). However, each shapefile camewith a different relational schema, i.e., the data hadstructural heterogeneity. A shapefile can be under-stood as a relational table having one column thatholds the spatial object (e.g., a polygon), and a num-ber of data columns holding attribute values that de-scribe the spatial object. For example, the relationalschema of the Arizona shapefile is:

Arizona(AREA, PERIMETER, AZ 1000 , AZ 1000 ID,

GEO, PERIOD, ABBREV, DESCR,

D SYMBOL, P SYMBOL)

Here, AREA is the spatial column, while PERIOD andABBREV are data columns with information on the ge-ologic age and formation of the represented region,respectively. The maps from the other states areeach structured slightly differently. For example, theIdaho shapefile has 38 columns and contains detailedlithology information, describing rocks by their color,mineralogic composition, and grain size.

Arizona

Colorado

Utah

Nevada

Wyoming

New Mexico

Montana E.

Idaho

Montana West

…Age

…Formation

…Age

…Formation

…Age

…Formation

…Age

…Formation

…Age

…Formation

…Age

…Formation

…Age

…Formation

Texture…

Fabric…

Composition…

Age…

Formation…

Texture…

Fabric…

Composition…

Age…

Formation…

ABBREV

PERIOD

PERIOD

NAME

PERIOD

TYPE

TIME_UNIT

FMATN

PERIOD

NAME

PERIOD

NAME

FORMATION

PERIOD

FORMATION

FORMATION

LITHOLOGY

LITHOLOGY

AGE

AGE

andesitic sandstone

Livingston formation

Tertiary-Cretaceous

Fig. 3: Schema integration in the OMI prototype:Some elements of the local schemas (outside) aremapped to the integration schema (inside).

Figure 3 shows the association between thecolumns of the different source schemas and thoseof the integration schema. Note that the Formationattribute (column) in the integrated schema (in thecenter of Figure 3) is derived from different at-tributes of the source schemas, i.e., Arizona.ABBREV,Colorado.NAME, Utah.TYPE, and so on. The Idahoand West Montana provide additional detailed in-formation on lithology (e.g., andesitic sandstone).The latter can be used to derive further informa-tion on the rock type associated with a region suchas mineral and chemical composition, fabric, and

texture. This assumes that a corresponding ontol-ogy of rock types is given. Using such an ontol-ogy we can infer that regions of the map markedas andesitic sandstone are also regions with sedi-mentary rock (because sandstone is a sedimentaryrock) and that their modal mineral composition iswithin a certain range, e.g., Q/(Q+A+P) < 20% orF/(F+A+P) < 10%, P/(A+P) > 90% and M < 35[Joh02].

Fig. 4: Results of a query for regions with“Paleozoic” age: without ontology (left), and withontology (right) [LL03].

Concept-based (“Semantic”) Queries. The re-sults of a simple conceptual-level query, asking forall regions with geologic age Paleozoic, are shown inFigure 4. Recall that all nine maps have geologicage information; nevertheless, few results are foundwhen doing a simple lookup for Palezoic in the Agecolumn (see the left side of Figure 4). This occursbecause by only looking for Paleozoic, we have nottaken into account the domain knowledge that othergeologic ages such as Cambrium and Devon also fallwithin the Paleozoic. By using a corresponding geo-logic age ontology, the system can rewrite the originaluser query into one that looks for Paleozoic and allits “sub-ages”. The result of this ontology-enabledquery is shown on the right in Figure 4—where amore complete set of regions is returned.

An important prerequisite for such knowledge-based extensions of data integration systems is se-mantic data registration [BLL04a]. In a nutshell,data objects (here, the polygons making up the spa-tial regions) must be associated with concepts froma previously registered ontology. In the system, weuse several such ontologies, for geologic age (derivedfrom [HAC+89]) and for rock type classification (de-rived from [SQDO02] and [GSR+99]). In Section 4.3,we discuss further details of data registration and re-visit the geologic-map integration example.


Fig. 5: Mineral Classification workflow (left) and generated interactive result displays (right).

3.2 Mineral Classification Workflow

The previous integration example was data-centricand made use of domain knowledge by using an on-tology to answer a conceptual-level query. The sec-ond integration example is process-centric and illus-trates the use of a scientific workflow system for au-tomating an otherwise manual data-analysis proce-dure, or alternatively, for reengineering an existingdata-analysis tool in a more generic and extensibleenvironment.

The upper left window in Figure 5 shows the top-level workflow, where data points are selected froma database of mineral compositions of igneous rocksamples. This data, together with a set of classifica-tion diagrams are fed into a Classifier subworkflow(bottom left). The manual process of classifying sam-ples involves determining the position of the samplevalues in a series of diagrams such as the ones shownon the right in Figure 5. If the location of a sam-ple point in a non-terminal diagram of order n hasbeen determined (e.g., diorite gabbro anorthosite, bot-tom right), the corresponding diagram of order n+1 isconsulted and the point is located therein. This pro-cess is iterated until the terminal level of diagrams is

reached. The result is shown in the upper right of Fig-ure 5, where the classification result is anorthosite).

This traditionally manual process has been auto-mated in specialized commercial tools. Here, weshow how the open source Kepler workflow sys-tem [KEP, LAB+04] can be used to implement thisworkflow in a more open and generic way (Figure 5).The workflow is shown in graphical form using theVergil user interface [BLL+04b].3 Note that inVergil, workflows can be annotated with user com-ments. Workflows can be arbitrarily nested and sub-workflows (e.g., shown in the bottom-left of the fig-ure) become visible by ”looking inside” a compositeactor.4 The box labeled Classifier is a compositeactor. Vergil also features simple VCR-like controlbuttons to play, pause, resume, and stop workflowexecution.

Keplerspecific features of this workflow include asearchable library of actors and data sources (Actorand Data tabs close to the upper-left) with numerousreusable Kepler actors. For example, the Browser

3Kepler is an extension of the Ptolemy ii system and in-herits many of its features, including the Vergil GUI.

4Following Ptolemy ii terminology, a workflow component,whether atomic or composite, is called an actor in Kepler.

4 DATA INTEGRATION 7

actor (used in the bottom-right of the Classifiersubworkflow) launches the user’s default browser andcan be used as a powerful generic input/output devicein any workflow. In this example, the classificationdiagrams are generated on the client side as inter-active SVG displays in the browser (windows on theright in Figure 5). Moving the mouse over the dia-gram highlights the specific region and displays therock name classification(s) for that particular region.The Browser actor has proven to be very useful inmany other workflows as well, e.g., as a device to dis-play results of a previous step and as a selection toolthat passes user-requested data to subsequent stepsusing well-known HTML forms, check-boxes, etc.

4 Data Integration

In this section we first introduce the mediator ap-proach to data integration and then show how it canbe extended to a semantic mediation approach byincorporating ontologies as a knowledge representa-tion formalism. We also present formal foundationsof semantic data registration, which is an importantprerequisite for semantic mediation.

4.1 Mediator Approach

Data integration traditionally deals with structuralheterogeneities due to different database schemas andwith the problem of how to provide uniform useraccess to the information from different databases[She98] (see Section 2). The standard approach, de-picted in Figure 6, is to use a database mediator sys-tem [Wie92, GMPQ+97, Lev00].

In a mediator system, instead of interacting di-rectly with a number of local data sources S1, . . . , Sk,each one having its own database schema Vi (alsocalled the exported view of Si), the user or client ap-plication queries an integrated global view G. Thisintegrated view is given by an integrated view defi-nition (IVD), i.e., a query expression in a databasequery language (e.g., SQL or Datalog for relationaldatabases, or XQuery for XML databases).5 Here, weconsider the case that the integrated global view G isdefined in terms of the local views V1, . . . , Vk exportedby the sources. This approach is called global-as-view(GAV).6 For example, our geologic-map integration

5In complex multiple-world scenarios, coming up with asuitable IVD is a major problem, as one usually needs domainknowledge from “glue ontologies” to link between the sources.

6Sometimes it can be beneficial to provide the IVD in alocal-as-view (LAV) manner, where the local source informa-tion is defined in terms of the global schema [Hal01]. Evenmixed GAV/LAV and more general approaches exist, but are

MEDIATORMEDIATOR

Integrated (Global)View G

USER/Client(1) user query Q ( G (V1,..., Vk) )

S1

wrapper

local view V1

(3) subqueries Q1 Q2 Qk

(4) local results ans(Q1) ans(Q2) ans(Qk)

(6) integrated result ans(Q)

(2) query rewriting(5) post-processing oflocal results ans(Qi)

S2

wrapper

local view V2

Sk

wrapper

local view Vk

Integrated ViewDefinition (IVD)

G(..)←←←← V1(..)…Vk(..)

Fig. 6: Database mediator architecture and queryevaluation phases (1)–(6).

prototype (Section 3.1) integrates nine local sourceviews into a single integrated global view (Figure 3).The latter is defined by query expressions like thefollowing:

G(”Arizona”, AZ.Aid, Formation, Age, . . .)←−Arizona(Aid, . . . , ABBREV, . . . , PERIOD, . . .),Formation = ABBREV, Age = PERIOD.

G(”Nevada”, NV.Aid, Formation, Age, . . .)←−Nevada(Aid, . . . , FMATN, . . . , TIME UNIT, . . .),Formation = FMATN, Age = TIME UNIT.

The first rule states that the global view G is “filled”with information from the Arizona source by map-ping the local ABBREV and PERIOD attributes to theglobal Formation and Age attributes, respectively.Here, spatial regions from the AREA column are iden-tified via an Aid key attribute. To make the Aidattribute values unique across all sources, a uniqueprefix is used for each source (AZ.Aid, NV.Aid, etc.)to uniquely rename any potentially conflicting values.

Query Evaluation. Query processing in a data-base mediator system involves a number of steps (seeFigure 6): Assuming a global view G has been de-fined (which is normally the task of a data integrationexpert), the user or an application programmer candefine a query Q against the integrated view G (1).The database mediator takes Q and the integratedview definitions G(...)←...Vi... and rewrites them intoa query plan with a number subqueries Q1, . . . , Qkfor the different sources (2). These subqueries Qi arethen sent to the local sources, where they are eval-uated (3). The local answers ans(Qi) are then sent

beyond the scope of this paper; see [Koc01, Len02, DT03].


back to the mediator (4), where they are further post-processed (5). Finally, the integrated result ans(Q)is returned to the user (6).

There are many technical challenges in developingdatabase mediators. For example, the complexity ofthe query rewriting algorithm in step (2) depends onthe expressiveness of the query languages for the userquery Q and for the allowed source queries Qi. Theproblem of rewriting queries against sources with lim-ited query capabilities is solved (or solvable) only forrestricted languages; see, e.g., [VP00, Hal01, NL04b]for details.

4.2 Semantic Mediation

Consider again the geologic-map integration examplefrom Section 3.1. Above we sketched how the struc-tural integration problem can be solved by adoptinga mediator approach (see Figures 3 and 6). However,the traditional mediator approach alone does not al-low us to handle concept-based queries adequately.For example, in Figure 4, the conventional query forregions with Palezoic rocks yields too few results. Incontrast, after “ontology-enabling” the system with ageologic-age ontology, the user query can be rewrittensuch that it takes into account the domain knowledgefrom the ontology (see the right side of Figure 4).

The crux of semantic mediation is to augment themediator approach with an explicit representation ofdomain knowledge in the form of one or more ontolo-gies (also called domain maps in [LGM01, LGM03]).An ontology is often viewed as an explicit specifica-tion of a conceptualization [Gru93]. In particular,ontologies are used to capture some shared domainknowledge such as the main concepts of a domain,and important relationships between them. For ex-ample, the geologic-age ontology used in our OMIprototype can be viewed as a set of concepts (one foreach geologic age) that are organized as a hierarchy,i.e., a tree in which children concepts (e.g., Devon,Cambrium) are considered to be a subset of (i.e., arestricted set of ages of) the age described by the par-ent concept (e.g., Paleozoic). An “ontology-enabled”mediator can use the information from the geologic-age ontology to retrieve not only data that directlymatches the search concept Paleozoic, but also alldata that matches any subconcept of Paleozoic, suchas Devon or Cambrium. Formally, subconcepts arewritten using the v symbol, e.g., Devon v Paleozoicand Cambrium v Paleozoic in our example.

Concept-based Querying Revisited. A smallfragment of a more complex concept hierarchy, takenfrom [SQDO02], is depicted in Figure 7. This

“Canadian system” defines several (preliminary) tax-onomies (classification hierarchies) of rock types, i.e.,for composition (Figure 7 shows a small fragmentdealing with Silicate rock types); texture (e.g., Crys-talline vs. Granular); fabric (e.g., Planar types such asGneiss vs. Nonplanar ones such as Hornfels); and gen-esis (e.g., Igneous vs. Sedimentary rock). In Figure 7,subconcepts w.r.t. compositions are displayed as chil-dren to the right of the parent concept. For example,a Viriginite is a special Listwanite; moreover, we learnthat Listwanite rocks belongs to the Ultramafic kind ofSilicate rocks. Using this rock-type ontology, a num-ber of new semantic queries can be formulated andexecuted with the prototype. Here a semantic querymeans a query that is formulated in terms of the con-cepts in the ontologies. Specifically, the lithology in-formation provided by two of the nine state geologicmaps (Idaho and Montana West) can be linked toconcepts in the composition, fabric, and texture hi-erarchies, as sketched on the right in Figure 8. Onthe left of Figure 8, the result of a semantic queryfor Sedimentary rocks is displayed. Similarly, queriesfor composition (e.g., Silicate), fabric (e.g., Planar),and texture (e.g., Crystalline), or any combination7

thereof can be executed.To enable semantic mediation, the standard archi-

tecture in Figure 6 has to be extended to includesome form of expert knowledge for linking betweenotherwise hard-to-relate sources. An important pre-requisite is semantic data registration, i.e., the asso-ciation of data objects in the sources with conceptsdefined in a previously registered ontology. Beforegoing into the technical details of data registration,we first consider informally some ontology variantsand alternatives for knowledge representation, rang-ing from “napkin drawings” to formal description-logic ontologies expressed in OWL [OWL03].

Ontology Variants and Alternatives. One ofthe reasons to use ontologies in scientific data inte-gration is to capture some shared understanding ofconcepts of interest to a scientific community. Theseconcepts can then be used as a semantic referencesystem for annotating data, making it easier to finddatasets of interest and facilitating their reuse.

Alternatively, conventional metadata consists of at-tribute/value pairs, holding information about thedataset being annotated (e.g., the creator, creation-

7Domain experts will probably only ask for meaningful com-binations. Data mining techniques can be applied to this rock-classification ontology to extract only those composition, fab-ric, and texture combinations that are self-consistent with theontology. These are of course only an approximation of theactually existing combinations.


Fig. 7: Rock type classification hierarchy (fragment) based on composition [SQDO02].

date, owner, etc.) A metadata standard prescribesthe set of attributes that must (or can) be used in themetadata description. Metadata standards are oftendefined as XML Schemas [XML01]. A sophisticatedexample is the Ecological Metadata Language [NCE],a community standard developed by ecologists thataddresses several of the heterogeneity challenges dis-cussed above. An EML description of a dataset canprovide information on how to parse the data file(syntax), what schema to use for queries against it(structure), and even indicate some semantic infor-mation, e.g., on unit types and measurements.

Controlled vocabularies are often part of a meta-data standard and are used to constrain the valuesof particular attributes to come from a fixed, agreed-upon set. Thus, instead of allowing, e.g., arbitraryrock names in a geologic map shapefile, it is prefer-able to only use those from a controlled vocabulary.In this way, searches can be guided to only use theseterms. Since the terms and definitions are (ideally)agreed-upon by the community, they can also becomethe starting point of a more formal ontology.

Some controlled vocabularies provide relationshipsbetween the allowed terms. Similar to the rela-tionships “narrower term”/“broader term” in a the-saurus, the concepts may be organized into a hierar-chy or taxonomy (for the purpose of classification),where a child concept C is linked to a parent D, ifC and D stand in a specific relationship, such as Cis-a D or C part-of D. For example, the geologic ageontology can be seen as a taxonomy with the child-parent relation “is-temporally-part-of ”; the chemicalcomposition ontology (Figure 7) is a taxonomy withthe child-parent relation “has-composition-like”.

Finally, (formal) ontologies not only fix a set ofconcept names, but also define properties of conceptsand their relationships. Description logics [BCM+03]are decidable fragments of first-order predicate logic,and are commonly used for specifying formal ontolo-gies. Whereas taxonomies often result from explicitstatements of is-a relationships, the concept hierar-chy in a description-logic ontology is implicit in theaxiomatic concept definitions. In description logicsone defines concepts via their properties and relation-


Canadian“Ontology”

British“Ontology”

Fig. 8: Different ontology-enabled views on the same geologic maps: the Canadian system [SQDO02] sup-ports queries along several hierarchies (genesis, composition, fabric, texture); the British system [GSR+99]provides a single hierarchy (a separate geologic age ontology is used in both views). Via an ontology articu-lation, data registered to one ontology can be retrieved through the conceptual view of the other ontology.

ships; the concept hierarchy is then a consequence ofthose definitions and can be made explicit by a de-scription logic reasoning system. A description logicontology consists of logic formulas (axioms) that con-strain the interpretation of concepts by interrelatingthem, e.g., via binary relationships called roles. Forexample, the description-logic axiom

Virginite v Rock u Listwanite u ∃foundAt.Virginia

states that (i) instances of Virginite are instances ofRock, (ii) they are also instances of Listwanite, and(iii) they are found at some place in Virginia. It isnot stated that the converse is true, i.e., that everyListwanite rock found in Virginia must be a Virginite(this could be stated by using “≡” instead of “v”).Here, the lowercase symbol foundAt denotes a role(standing for a binary relationship, in this case be-tween a rock types and locations); all other uppercasesymbols denote concepts, each one denoting a set ofconcept instances, e.g., all Listwanites.

Description logics and reasoning systems for check-ing concept subsumption, consistency, etc. havebeen studied extensively over many years [BCM+03].However, until recently there was no widely acceptedsyntax for description-logic ontologies. With the in-creased interest in using description-logic ontologies,e.g., for Semantic Web applications [BLHL01], theneed for a standard web ontology language has led

to the W3C OWL standard [OWL03].8 OWL, be-ing an XML standard, also supports namespaces, aURI-based reference system. In OWL, namespacesare used to help express inter-ontology articulations,i.e., formulas that relate concepts and relationshipsfrom different OWL ontologies.

Before defining a formal ontology for use in scien-tific data integration systems, a community-based ef-fort may go through several intermediate steps, frominformal to more formal representations. As a ruleof thumb, the more formal the knowledge represen-tation, the easier it is for a system to make good useof it, but also the harder it is usually to develop sucha formal representation. A common starting pointis an informal concept diagram or “napkin drawing”initially created by members of a community to givean overview of important items or concepts in thedomain. Figure 9 shows such an informal concept di-agram that resulted from a workshop with aqueousgeochemistry experts. The diagram relates specificraw and derived data products, models, and scientificproblems to one another. While such a diagram isuseful as a communication means between domain ex-perts, a data integration system cannot directly make

8OWL can be used not only for description logics (OWL-DL) but also for a simpler fragment (OWL-lite) or a moreexpressive version (OWL-full). In the OMI prototype, we haveused simple concept hierarchies expressible in OWL-lite.


Fig. 9: Informal concept diagram (“napkin drawing”), relating raw data (red ovals), derived data (bluediamonds), models (yellow squares), and scientific problems (green ovals) in aqueous geochemistry.

use of it. One possible elaboration is the definition ofmetadata standards for data exchange between dif-ferent community tools and applications, addressingsyntactic and structural issues in data integration.Another possible subsequent step is the definition ofone or more concept hierarchies (taxonomies) like inthe Canadian system (Figure 7), thus enabling simplesemantic mediation. A slightly more general form, la-beled, directed graphs, use nodes for concepts and edgelabels to denote arbitrary relationships between con-cepts. This model can be used by a data integrationsystem to find concepts and relationships of interestvia generalized path expressions (e.g., see [LHL+98]).

Finally, logic-based formalisms such as OWL on-tologies can be used not only to declare conceptnames and their relationships, but to intensionallydefine new concepts relative to existing ones, and tolet a reasoning system establish the logical consis-tency of the system and the concept hierarchy.

4.3 Semantic Data Registration

In this section, we take a closer look at the tech-nical details of semantic mediation, in particular wepresent an approach called semantic data registrationthat can facilitate data integration in such a system.Figure 10 shows an overview of the architecture ofour proposed framework. The main components in-clude services for reasoning with ontologies, databasemediation, registration, and data access. We assumethat a federation registry (not shown) stores core reg-istry information, including the database schemas ofthe underlying data sources, service descriptions, on-tologies and ontology articulations, external semanticconstraints (e.g., unit conversion rules and other sci-entific formulas), and registration mapping rules. for-malizing the components of our framework and thendiscuss resource registration in more detail in the nextsection.


S1

data access services

S2 Sk

format conversion native query

registration services

verification contextualization

semantic mediation services

rewriting integration

ontology services

reasoning classification

O1 O2 A1 A2

data sources (relational, XML, spreadsheets,…)

ontologies and articulations

Fig. 10: Overview of the registration architecture.

First-Order Logic. We use first-order logic as astandard, underlying formalism. The syntax of first-order logic is defined as usual, i.e., we consider sig-natures Σ with predicate symbols ΣP and functionsymbols ΣF . By ΣP.n (ΣF.n) we denote the subsetsof n-ary predicate (function) symbols; ΣC = ΣF.0 areconstants. Semantics: A first-order structure I in-terprets predicate and function symbols as relationsand functions, respectively; constants are interpretedas domain elements. Given I and a set of formulasΦ over Σ, we say that I is a model of Φ, denotedI |= Φ, if I |= ϕ for all ϕ ∈ Φ, i.e., all formulas inΦ are satisfied by I. We can implement constraintchecking by evaluating the query x | I |= ϕ(x).

Ontologies (Revisited). Given the above prelim-inaries, we can now consider an ontology as a certainset of first-order formulas: An ontology O is a setof logic axioms ΦO over a signature Σ = C ∪R ∪ Icomprising unary predicates C ⊆ ΣP.1 (concepts),binary predicates R ⊆ ΣP.2 (roles, properties), andconstants I ⊆ ΣF.0 (individuals). ΦO is usually froma decidable first-order fragment; most notably de-scription logics [BCM+03]. A structure I is calleda model of an ontology ΦO, if I |= ΦO.

We can view controlled vocabularies and metadataspecifications as limited, special cases of ontologies.A controlled vocabulary can be viewed, e.g., as a fixedset I ⊆ ΣF.0 of individuals (constants); a set of named

concepts C; or even a full ontology signature Σ (if itcontains relationships between terms of the controlledvocabulary). In either case, there are no axioms andhence no defined logical semantics. A metadata spec-ification can be seen as an instance of an ontologyhaving only binary predicates R denoting the meta-data properties (e.g., title, author, date; see DublinCore). Again, the absence of axioms means that nological semantics is defined.

Namespaces. In the federation registry, we avoidname clashes between vocabularies from different sci-entific resources (datasets, services, etc.) by assum-ing each resource has a globally unique identifier i(e.g., implemented as a URI). We then rename sym-bols accordingly: Every symbol in Σi is prefixedwith its resource-id i to obtain a unique vocabularyΣ′i := i.s | s ∈ Σi, allowing new resources to join

the federation without introducing identifier conflicts.For example, in the view definitions in Section 4.1, wedisambiguated object identifiers by using a state pre-fix as a resource-id: AZ.Aid, NV.Aid, etc. A resource-id is also commonly referred to as a namespace. Be-low, by id(s) we denote the globally unique resourceidentifier of a symbol s.

Registering Ontologies and Articulations. Anontology O is registered by storing its logic axiomsΦO and its signature ΣO in the federation registry.9

An articulation ontology A links two ontologies Oiand Oj and is given as a set of axioms ΦA over ΣA =ΣOi∪ΣOj

, thereby logically specifying inter-ontologycorrespondences. For example, i.C ≡ j.(Du∃R.E) is anarticulation axiom ϕ ∈ ΦA and states that the con-cept C in Oi is equivalent—in terms of Oj—to thoseD having at least one R-related E. This is expressedequivalently as follows (using first-order logic syntax):

∀x : i.C(x)↔ j.D(x) ∧ ∃y : j.R(x, y) ∧ j.E(y) (ϕ)

Note that ϕ is an intensional definition: we have notsaid how we can access instance objects (implicitlyreferred to via variables x and y), i.e., how to popu-late C, D, etc., as classes of objects. Finally, express-ing inter-ontology articulations as ontologies achievesclosure within the framework: There is no need tomanage a new type of artifact and we can reuse thegiven storage, querying, and reasoning techniques.

Structural Data Registration. When register-ing a database, schema-level information and querycapabilities should be included to facilitate queries by

9We use OWL as the concrete syntax for ontologies.


the end user or a mediator system. Specifically, thedatabase registration information contains:• The database schema ΣD. In the case of a re-

lational database D, we have ΣD = V1, . . . ,Vn,where each Vi is the schema of an exported relationalview (see Figure 6).• A set of local integrity constraints ΦD. We can

distinguish different types of constraints, e.g., struc-tural constraints (such as foreign key constraints) andsemantic constraints.• A query capability specification ΠD. For example,

ΠD may be a set of access patterns [NL04a], prescrib-ing the input/output constraints for each exportedrelation. More generally, ΠD may be given as a set ofview definitions (possibly with access patterns) sup-ported by the source D. If D provides full SQL capa-bilities, a reserved word can be used: ΠD = SQL.

To register the structural definition of the datasource, a data-access handler or wrapper (shown bothin Figure 6 and 10) must also be provided. The wrap-per provides basic services for executing underlyingqueries and converting data to a common format foruse by the registration and mediation service.

Semantic Data Registration. A semantic dataregistration registers the association between data ob-jects in a database D and a target ontology O. Letk = id(Dk) and j = id(Oj) be the unique resourceidentifiers of Dk and Oj , respectively.

The semantic data registration of Dk to Oj is givenby a set of constraints Ψkj , where each ψ ∈ Ψkj is aconstraint formula over ΣD ∪ ΣO. For example, thesemantic data registration formula ψ =

∀x∀y : j.D(x) ∧ j.R(x, y)← ∃z : k.P(x, y, z) ∧ k.Q(y, z)

is a constraint saying that ontology O’s concept Dand its role R can be “populated” using certain tu-ples P(x, y, z) from D. When the semantic data reg-istration constraint ψ and the above articulation ϕare combined into ψ ∧ ϕ, we see that data objectsfrom Dk can be linked to concepts like i.C(x) in theontology Oi, despite the fact that Dk was registeredto Oj and not to Oi. The reason is that an indirectlink exists to Oi via the articulation axiom ϕ:10

Dkψ; Oj

ϕ←→ Oi

For a concrete example, assume the database Dk

represents a geologic map, and the ontologies Oj andOi represent the Canadian rock classification sys-tem [SQDO02] and the British one [GSR+99], respec-tively. We can register Dk to the Canadian system

10The actual links via ϕ are given by the valuations thatmake the formula true.

(encoded in OWL) using mapping rules correspond-ing to formulas like ψ above. This allows a semanticmediation system like OMI to provide concept-basedquery capabilities: On the left in Figure 8, the re-sulting query interface with fields for composition,texture, and fabric is shown.

An important application of articulation axiomslike ϕ relating Oj and Oi is that they can be used toquery and view the data from Dk through the con-ceptual view provided by Oi. In our geologic mapexample, we can query the geologic maps using theBritish rock classification view (a single, large hierar-chy; see Figure 8, right), despite the fact that the ge-ologic map database was originally registered to theCanadian system. Figure 8 shows the query inter-faces and results of querying the same geologic mapdatabases, using different ontology views. Note thatthe highlighted differences in the results shown in thefigure could have different causes. For example, theymight result from asking slightly different questions,or from a different relative completeness of one on-tology over the other, or from the use of only partialmapping information in the ontology articulation.11

Datasets as Partial Models

This section further defines the steps involved in se-mantic data registration. In particular, we clarifythe result of semantically registering a dataset as apartial model and motivate the need for additional,data-object identification steps (see Figure 11).

Registering Partial Models. A dataset D that isregistered to an ontology O contributes to the extentof the federation relative to O. Registered datasetsare not materialized according to the ontology; in-stead the registration mappings are used to access theunderlying sources when needed, similar to the wayqueries against an integrated views are evaluated ondemand (see Figure 6). A datasetD can be registeredconsistently to an ontology O if it can be interpretedas a partial model I of O, denoted I |=p ΨO, whichimplies I ∪ I ′ |= ΨO for some unknown I ′.

A partial model differs from a true model of theontology in that some required information may bemissing. We denote the interpretation induced by ap-plying a semantic data registration ΨD of database Dto an ontology O as ID. If the latter is a partial modelID |=p ΨO, then the model ID ∪ I ′

D |= ΨO contains

11In our OMI prototype, we only used a rough approxima-tion of the data registration and articulation mappings. Themappings were based on (partial) syntactic matches and didnot include a systematic review by a domain scientist.


O1 O2 A1 A2

data sources

ontologies and articulations

S1 S2 Sk

ontological domain(implied)

integration space

Identity declaration

Registration Mappings

Partial models

1 2 k

Fig. 11: Result of semantic data registration.

an unknown or hidden part I ′D. As more sources are

registered, more of I ′D may become known.

When an interpretation induced by a semantic dataregistration is not a partial model of an ontology, wesay that the interpretation is inconsistent. An incon-sistent interpretation often violates a datatype, car-dinality, or disjoint constraint in the ontology. Whenpossible, we wish to automatically verify that a se-mantic data registration is consistent, e.g., by ensur-ing that the dataset is satisfies the axioms of the on-tology, or can be extended to do so.

Identification Declaration. Semantic data regis-tration allows a dataset to be interpreted as a partialmodel of an ontology, but does not necessarily pro-vide enough information to identify the same domainelements (individuals) of the integration space acrossmultiple datasets, which is essential for data integra-tion. The ability to identify equivalent data objectsacross different datasets is needed in practice becauseeach dataset may only provide a portion of the in-formation concerning a particular object. As shownin Figure 11, we consider an additional registrationstep called identification declaration that allows dataproviders to state how data objects should be identi-fied across data sources.

Such an object identity can be defined in a num-ber of ways. First, a semantic data registration canbe augmented with mapping tables, which map in-dividual data items to recognized individuals in I,i.e., individuals that are established instances withinan ontology and come from an authoritative registry.

For example, given an ontology of rock types, a map-ping table can associate rock names in a geologicdataset with the unique rock types from the ontology.Second, external rules may be used for determiningidentity, similar to keys in a relational database.12

For example, we may have a rule that ISBN or DOIcodes uniquely identify publications, thus, register-ing to such a code uniquely identifies the data ob-ject. Finally, a data provider may give data-objectcorrespondences between registered data sets. Thus,a data object is explicitly given as equivalent to an-other data object (although the specific identifiers ofthe objects may not be authoritative).

Data Integration via Semantic Registration

We identify four classes of semantic data registrationexpressiveness (in terms of data integration) as fol-lows.• Concept-as-keyword registration. We can con-

sider metadata annotations using keywords from acontrolled-vocabulary as (a weak form of) registra-tion mappings. For example, we can assign a conceptsuch as geologic-age to the dataset as a whole.13 Sucha mapping states that the dataset contains data ob-jects, and those data objects refer to individual geo-logical ages. However, we cannot consider or obtaineach such separate (geologic-age) data object in thedataset. Clearly, such a registration can not be usedfor integration, however, it can be used for datasetdiscovery: We do not have access to the individualobjects, so the best we can do is find the dataset thatcontains such objects.• Local data-object identification. Local data-

object identification is the typical result of a regis-tration mapping, where local identifiers are used toidentify logical data items within a dataset. In thiscase the identities of the individuals are local to thesource, and thus, cannot be used to combine dataobjects from multiple sources.• Global data-object identification. The result of

globally identifying data objects is that it becomespossible for a mediator to recognize identical individ-uals in multiple datasets. The result of global data-object identification is the ability to perform objectfusion [PAGM96] at the mediator.• Property identification. If within a given dataset

we relate two globally-identified data objects with anontological relation in R, it becomes possible to joininformation across datasets (assuming at least onedata object occurs in at least one other relation in

12Some description logics also support keys [LAHS03].13e.g., by registering the dataset’s resource-id with the con-

cept, or registering all rows of the dataset with the concept

5 SCIENTIFIC WORKFLOWS 15

another source). This situation represents a strongerform of integration compared to simple object fusion,and is required for wide-scale data integration.

5 Scientific Workflows

In this section we return to the final integration chal-lenge, synthesis (see Section 2), and illustrate howscientific workflows can be used to create new toolsand applications from existing ones.

Scientific workflows are typically used as “dataanalysis pipelines” or for comparing observed andpredicted data and can include a wide range of com-ponents, e.g., for querying databases, for data trans-formation and data mining steps, for execution ofsimulation codes on high performance computers, etc.Ideally, a scientist should be able to (1) plug-in al-most any scientific data resource and computationalservice into a scientific workflow, (2) inspect and vi-sualize data on-the-fly as it is computed, (3) makeparameter changes when necessary and re-run onlythe affected “downstream” components, and (3) cap-ture sufficient metadata in the final products. Foreach run of a scientific workflow, when considered asa computational experiment, the metadata producedshould be comprehensive enough to help explain theresults of the run and make the results reproducibleby the scientist and others. Thus, a scientific work-flow system becomes a scientific problem-solving en-vironment, tuned to an increasingly distributed andservice-oriented Grid infrastructure.

However, before this vision can become reality, anumber of technical problems have to be solved. Forexample, current Grid software is still too complexto use for the average scientist, and fast changingversions and evolving standards require that thesedetails be hidden from the user by the scientificworkflow system. Web services provide a simplebasis for loosely coupled, distributed systems, butcore web-service standards such as WSDL [WSD03]only provide simple solutions to simple problems,14

while harder problems such as web-service orchestra-tion remain the subject of emerging or future stan-dards. As part of an open source activity, membersfrom various application-oriented research projects[GEO, SEE, SDM, BIR, ROA] are developing theKepler scientific workflow system [KEP, LAB+04],which aims at developing generic solutions to the pro-cess and application-integration challenges of scien-tific workflows.

14e.g. WSDL mainly provides an XML notation for functionsignatures, i.e., the types of inputs and outputs of web services

5.1 Scientific Workflows in Kepler

In Section 3.2 we have already presented a Ke-pler workflow for mineral classification (Figure 5).Kepler extends the underlying Ptolemy ii system[BCD+02] by introducing a number of new compo-nents called actors, e.g., for querying databases, fordata transformations via XSLT, for executing localapplications from a command line, web services viatheir WSDL interface, or remote jobs on the Grid,etc. In Kepler, the user designs a scientific workflowby selecting appropriate actors from the actor library(or by dynamically “harvesting” new ones via theKepler web service harvester) and putting them onthe design canvas, after which they can be “wired” toform the desired larger workflow. For workflow com-ponents that are not yet implemented (i.e., neither asa native actor nor as a web service or command-linetool), a special design actor can be used. Like regu-lar actors, the design actor has input ports and out-put ports that provide the communication interface toother actors. The number, names, and data types ofthe ports of the design actor can be easily changed toreflect the intended use of the actor. When designinga workflow, the user connects actors via their ports tocreate the desired overall dataflow.15 A unique fea-ture of Ptolemy ii and thus of Kepler is that theoverall execution and component interaction seman-tics of a workflow is not buried inside the componentsthemselves, but rather factored out into a separatecomponent called a director. For example, the PN(Process Network) director used in the workflow inFigure 5 (green box) models a workflow as a pro-cess network [KM77, LP95] of concurrent processesthat communicate through unidirectional channels.The SDF (Synchronous Data-Flow) director in Fig-ure 12 is a special case of the PN director that canbe used when all actors statically declare the numberof tokens they consume and produce per invocation(called an actor “firing”).16 The SDF director usesthis information to statically analyze the workflow,e.g., to detect deadlocks in the workflow, or to de-termine the required buffer size between connectedactors.

5.2 Gravity Modeling Workflow

Figure 12 shows a gravity modeling design workflow.Unlike the mineral classification workflow discussed

15Control-flow elements such as branching and loops are alsosupported; see [BLL+04b, LAB+04].

16A token represents the unit of data flowing through a chan-nel between actors. The workflow designer can choose to use,e.g., a single data value, a row from a database table, an XMLelement, or a whole table or file as a single token.


Fig. 12: Gravity Modeling Design Workflow: The main workflow (top) combines three different gravitysources into a single model: an observed model (expanded below) is compared to a synthetic model, whichitself results from an integrated gravity model combining a Moho and a Sediment web service (top window,center). Outlined olive boxes are design actors. Unlike other actors, these are not (yet) implemented andultimately need to be replaced by executable versions before the complete workflow can execute.

in Section 3.2, this workflow involves componentsthat are not yet implemented. This feature allowsthe user of the Kepler system to seamlessly go froma conceptual design workflow to an executable versionby replacing design components with implementedones as they become available. Another benefit ofthis feature is that executable subworkflows (i.e., onesthat do not include design actors) can already beunit-tested and debugged early on while other partsof the workflow are still in their design phase.

The workflow depicted in Figure 12 takes a spa-tial envelope token (given via latitude and longitudeparameters in the lower-left corner of the main win-dow) and feeds it into two independent web servicesthat extract gridded Moho and sedimentary depthsfor the enveloped region, respectively and feed themto a GravityModeler actor. This synthetic grav-ity model is then compared to the observed model

and the result fed into a ResidualMap web serviceactor for creating the result map. The latter is shownto the user via a standard browser. Note that a sub-workflow is associated with the ObservedGravity-Model actor, i.e., the latter is a composite actor.In Figure 12, the subworkflow of this composite ac-tor is shown in the lower window. It involves twocomponents, one to translate between coordinate sys-tem (implemented) and one to access values from theUTEP gravity database (designed). While this dataaccess is not yet implemented, the signature of thedesign actor reveals that it needs to take an envelopetoken (in degrees) and produce from the database agravity grid of observed data.


5.3 Semantic Workflow Extensions

An important aspect when developing scientific work-flows are the structural and semantic types of actorsand services they represent. Here, by structural typewe mean the conventional data type associated withthe input and output ports of an actor. Kepler in-herits from the Ptolemy ii system a very flexiblestructural type system in which simple and complextypes are organized into an extensible, polymorphictype structure. For example, the AddSubtract ac-tor in Figure 12, which takes the synthetic data out-put from the GravityModeler and compares it tothe ObservedGravityModel output, is polymor-phic and can operate on integers, floats, vectors, andmatrices. In the case of the gravity workflow, af-ter connecting AddSubtract with the correspond-ing upstream and downstream actors, the system candetermine that a matrix operation is needed. If ac-tor ports are connected that have incompatible types,the system will report a type error.

While a structural type check can guarantee thatactor connections have compatible data types, theyprovide no means to check whether the connectionsare even potentially meaningful. For example, a con-nection between an actor outputting a velocity vec-tor and one that takes as input a force vector may bestructurally valid, but not semantically. For phys-ical units, a unit type system has been added toPtolemy ii to handle such situations [BLL+04b].

The idea behind semantic types in Kepler is tofurther extend the type system to allow the user to as-sociate a formal concept expression with a port. Thussemantic types in scientific workflows are related tothe idea of semantic data registration (Section 4.3)which associates a data object having a structuraltype with a concept from an ontology. Our semantictype system17 will detect, e.g., that a port of typegeologic-age can not be connected to one of type tex-ture, although both ports might have the same struc-tural type string. Similarly, vector(force) and vec-tor(velocity) can be detected as semantically incom-patible, although their structural type vector(float)coincides.

There are several applications of a semantic typesystem in scientific workflows. As was the case for se-mantic mediation and data registration, some domainknowledge above the structural level of databaseschemas and conventional data types can be cap-tured in this way. Thus semantic type informationcan be used for concept-based discovery of actors

17The semantic type system is currently under developmentand being added to Kepler. More details can be found in[BL04].

and services from actor libraries and service repos-itories, similar to the concept-based data queries inSections 3.1 and 4.2. Based on the semantic typesignature of an actor, the system can also search forcompatible upstream actors that can generate the de-sired data or for downstream actors that consume theproduced data. In this way, a workflow system withsemantic types can support workflow design by onlyconsidering connections which are valid w.r.t. theirsemantic type.

Ontology-Driven Data Transformations. Wenow briefly illustrate a further application of seman-tic types, i.e., how they can be used to aid the con-struction of structural transformations in scientificworkflows. Indeed a common situation in scientificworkflows is that independently developed servicesare structurally incompatible (e.g., they use differ-ently structured complex types, in particular differentXML schemas), although they might be semanticallycompatible. For example, we might know that theports Ps (source) and Pt (target) of two actors pro-duce and consume, respectively, a matrix of gravityvalues. Thus, these ports should be connectible interms of their semantic type. However, the actorsmay invoke independently designed web services thathave different XML representations of gravity matri-ces, making the desired connection structurally in-valid. The conventional approach to solve this struc-tural heterogeneity is to manually create and inserta data transformation step that maps data from theXML schema used by Ps to data that conforms to theschema of Pt.

SourceServiceSourceService

TargetServiceTargetService

Ps Pt

SemanticType Ps

SemanticType Ps

SemanticType Pt

SemanticType Pt

StructuralType Pt

StructuralType Pt

StructuralType Ps

StructuralType Ps

Desired Connection

Compatible ()

Source RegistrationMapping MS

Target RegistrationMapping Mt

CorrespondenceCorrespondence

Generate δ(Ps)δ(Ps)

Ontologies (OWL)Ontologies (OWL)

Transformation

Fig. 13: Ontology-driven generation of data trans-formations [BL04].

The idea of ontology-driven data transformations[BL04] is to employ the information provided by thesemantic type of a port to guide the generation of adata transformation from the structural type of thesource port to that of the target port. Note that


the semantic type alone, when considered in isolationfrom the structural type cannot be used for this pur-pose. However, via a semantic data registration map-ping (see Section 4.3), an association between struc-tural type and semantic type is created that can beexploited for the generation of the desired transfor-mation. Figure 13 gives an overview of the approach:The source port Ps and target port Pt are assumed tohave incompatible structural types (denoted Ps 6 Pt)but compatible semantic types (denoted Ps v Pt). Inother words, the semantic type of Ps is the same or asubtype of the one for Pt, but this is not the case forthe structural types. The goal is thus to generate adata transformation δ that maps the structural typeof Ps to one that is compatible with the structuraltype of Pt. We call a semantically valid connectionstructurally feasible if there exists such a data trans-formation δ with δ(Ps) Pt. The core idea is thatthe information from the semantic registration map-pings induces correspondences at the structural levelthat can help in the generation of the desired datatransformation δ(Ps).

Correspondence Mappings. More precisely, letMs be the source registration mapping that links be-tween the structural type and the semantic type ofPs; similarly, let Mt denote the target registrationmapping (see Figure 13). Ms and Mt can be seen asconstraint formulas Ψs and Ψt as described in Sec-tion 4.3, relating data schemas to expressions of theontology to which the data is registered. Ms can begiven, e.g., as a set of rules of the form qs ; Es,where qs is a path expression or more general querythat “marks” parts of interest in the data structureof Ps, and Es is a concept expression over the ontol-ogy O to which Ps is semantically registered. Thecorrespondence mapping Mst := Ms 1O Mt is nowgiven as the semantic join of Ms and Mt w.r.t. theontology O: Given a rule qs ; Es ∈ Ms and a ruleqt ; Et ∈ Mt, the rule qs ; qt is in the corre-spondence mapping Mst if and only if the semanticjoin Es vO Et holds, i.e., if the concept expression Esyields a semantic subtype of Et in the ontology O.

For example, assume a scientist wants to connectvarious ports of two actors that deal with geologicmap information. Assume that the port Ps of thesource actor produces XML tokens with the followingstructure

<rinfo>

<age>...</age>

<ccomp>...</ccomp>

<text>...</text>

<fab>...</fab>

</rinfo>

and that the port Pt of the target actor consumesXML tokens that are structured as follows:

<properties>

<lithology>... </lithology>

<geoage>...</geoage>

</properties>

Clearly Ps 6 Pt, i.e., the structural types of Ps andPt are incompatible. Now consider the source regis-tration mapping Ms =

/rinfo/age ; O.geologic age/rinfo/ccomp; O.lithology.composition/rinfo/text ; O.lithology.texture/rinfo/fab ; O.lithology.fabric

and the target registration mapping Mt =

/properties/lithology; O.lithology/properties/geoage ; O.geologic age

where O is the ontology to which the structures areregistered. Then the correspondence mapping Mst

contains the rule

/rinfo/age ; /properties/geoage

indicating how to transform the geologic age subele-ment of Ps to the one in Pt. Now assume furtherthat lithology.composition vO lithology, i.e., that inthe ontology, composition is considered a special kindof lithology information.18 Then Mst also includesthe correpondence rule

/rinfo/ccomp; /properties/lithology

In this simple example, the correspondence mappingMst may not provide detailed enough information todetermine a complete transformation from Ps to Pt.19

In particular, we do not know in this example whetherthe various elements involved in the mappings arecomplex (i.e., contain nested elements). If so, fur-ther information would be required to automaticallygenerate the transformation.

However, even if Mst does not provide enough in-formation to automatically generate the data trans-formation δ : Ps → Pt (see Figure 13), the obtainedcorrespondences are still valuable. For example, aworkflow engineer who needs to develop customizeddata-transformation actors for a scientific workflowcan use these correspondences as a “semantic” start-ing point to define the needed transformation. Also,correspondences can be exploited by varous databaseschema-matching tools [RB01], and used to infer ad-ditional structural mappings.

18For a detailed description of the vO relation see [BL04].19Note that if age, ccomp, lithology, and geoage are “sim-

ple” elements, i.e., only contain PCDATA, the transformationcan be generated from the correspondences alone.

REFERENCES 19

6 Conclusions

To answer specific scientific questions, a scientist of-ten repeatedly performs similar tasks and data man-agement steps. For example, scientists select appro-priate analysis methods to address the given ques-tion, search relevant data or create new data, de-termine whether existing data can be used for thedesired analyses, pre-process and integrate data asneeded, and so on. There are a number of signifi-cant data and tool integration challenges that needto be overcome to enable more of the increasinglydata-driven “e-science”, and to allow the scientist tospend less time on labor-intensive, error-prone man-ual data management.

In this paper we have given an overview of the dif-ferent kinds of data integration and interoperabilitychallenges in scientific data management. After re-viewing the traditional mediator approach to dataintegration, we have presented a knowledge-based ex-tension called semantic mediation that facilitates (i)the linking of hard-to-relate data sources by “goingthrough” shared ontologies, and (ii) new types ofconcept-based queries against the data. Semantic me-diation requires a “deeper modeling” of data whichcan be achieved by semantically registering existingdata sources to one or more ontologies. We have pre-sented some of the technical details of semantic dataregistration and illustrated semantic mediation usingexamples from the GEON project [GEO].

While semantic mediation addresses the problemof data integration, it does not provide a mechanismto integrate other applications and tools into dataanalysis “pipelines”. For this kind of process integra-tion problem, we have proposed the use of scientificworkflow systems like Kepler, to provide an openand extensible problem-solving environment for de-signing and executing workflows. Kepler is built ontop of the Ptolemy ii system and inherits from itmany useful features, including an actor-oriented de-sign methodology to create more reusable workflowcomponents, an extensible type system, and an intu-itive graphical user interface. Kepler specific exten-sions include database querying and data transfor-mation actors, and actors for executing web services,command line tools, and remote jobs on the Grid[LAB+04]. A large remaining problem is the genera-tion of data transformations that are often necessaryto connect two independently designed actors or webservices. To this end we have proposed an ontology-driven data transformation approach that exploits se-mantic data registration information to generate cor-respondence mappings, which in turn aid the gener-ation of the desired data transformations.

References

[BCD+02] S. S. Bhattacharyya, E. Cheong, J. DavisII, M. Goel, C. Hylands, B. Kienhuis, E. A.Lee, J. Liu, X. Liu, L. Muliadi, S. Neuen-dorffer, J. Reekie, N. Smyth, J. Tsay, B. Vo-gel, W. Williams, Y. Xiong, and H. Zheng.Heterogeneous Concurrent Modeling andDesign in Java. Technical Report Memoran-dum UCB/ERL M02/23, EECS, Universityof California, Berkeley, August 2002.

[BCM+03] F. Baader, D. Calvanese, D. McGuinness,D. Nardi, and P. F. Patel-Schneider, editors.The Description Logic Handbook: Theory,Implementation, and Applications. Cam-bridge University Press, 2003.

[BFH03] F. Berman, G. Fox, and A. Hey, editors.Grid Computing: Making the Global Infras-tructure a Reality. John Wiley & Sons, 2003.

[BIR] Biomedical Informatics Research NetworkCoordinating Center (BIRN-CC), Univer-sity of California, San Diego. http://

nbirn.net/.

[BL04] S. Bowers and B. Ludascher. An ontology-driven framework for data transformation inscientific workflows. In Proc. of the 1st Intl.Workshop on Data Integration in the LifeSciences (DILS), volume 2994 of LNCS, pp.1–16, 2004.

[BLHL01] T. Berners-Lee, J. Hendler, and O. Lassila.The Semantic Web. Scientific American,May 2001.

[BLL04a] S. Bowers, K. Lin, and B. Ludascher. Onintegrating scientific resources through se-mantic registration. In Proceedings of the16th International Conference on Scientificand Statistical Databases (SSDBM), 2004.

[BLL+04b] C. Brooks, E. A. Lee, X. Liu, S. Neuen-dorffer, Y. Zhao, and H. Zheng. Heteroge-neous Concurrent Modeling and Design inJava (Volumes 1-3). Technical report, Dept.of EECS, University of California, Berke-ley, 2004. Technical Memoranda UCB/ERLM04/27, M04/16, M04/17.

[Che76] P. Chen. The Entity-Relationship Model:Towards a Unified View of Data. ACMTransactions on Database Systems (TODS),1(1):9–36, 1976.

[DOS04] M. C. Daconta, L. J. Obrst, and K. T.Smith. Chapter 8: Understanding On-tologies. In The Semantic Web: A guideto the future of XML, Web Services andKnowledge Management, pp. 181–238. Wi-ley, 2004.

[DT03] A. Deutsch and V. Tannen. MARS: A Sys-tem for Publishing XML from Mixed and

http://nbirn.net/

http://nbirn.net/

REFERENCES 20

Redundant Storage. In Intl. Conference onVery Large Data Bases (VLDB), 2003.

[ESR98] ESRI. ESRI Shapefile Technical De-scription. http://www.esri.com/library/

whitepapers/pdfs/shapefile.pdf, 1998.

[FK99] I. Foster and C. Kesselman, editors. TheGrid: Blueprint for a New Computing In-frastructure. Morgan Kaufmann, 1999.

[Fos03] I. Foster. The Grid: Computing withoutBounds. Scientific American, April 2003.

[GEO] NSF/ITR: GEON: A Research Project toCreate Cyberinfrastructure for the Geo-sciences. www.geongrid.org.

[GMPQ+97] H. Garcia-Molina, Y. Papakonstantinou,D. Quass, A. Rajaraman, Y. Sagiv, J. Ull-man, V. Vassalos, and J. Widom. TheTSIMMIS Approach to Mediation: DataModels and Languages. Journal of Intel-ligent Information Systems, 8(2), 1997.

[Gru93] T. R. Gruber. A Translation Approach toPortable Ontology Specifications. Knowl-edge Acquisition, 5:199–220, 1993.

[GSR+99] M. R. Gillespie, M. Styles, S. Robertson,C. R. Hallsworth, R. W. O. Knox, C. D. R.Evans, A. A. M. Irving, J. W. Merritt, A. N.Morigi, and K. J. Northmore. BGS RockClassification Scheme, Volumes 1–4. Tech-nical report, 1999. http://www.bgs.ac.uk/bgsrcs/home.html.

[HAC+89] W. B. Harland, R. Armstrong, A. Cox,C. Lorraine, A. Smith, and D. Smith. AGeologic Time Scale 1989. Cambridge Uni-versity Press, 1989.

[Hal01] A. Halevy. Answering Queries Using Views:A Survey. VLDB Journal, 10(4):270–294,2001.

[Joh02] B. R. Johnson. Geologic Map Unit Classifi-cation, ver. 6.1, Draft, 2002. USGS.

[KEP] Kepler: A System for Scientific Workflows.http://kepler-project.org.

[KM77] G. Kahn and D. B. MacQueen. Corou-tines and Networks of Parallel Processes.In B. Gilchrist, editor, Proc. of the IFIPCongress 77, pp. 993–998, 1977.

[Koc01] C. Koch. Data Integration against MultipleEvolving Autonomous Schemata. PhD the-sis, Technische Universitat Wien, 2001.

[LAB+04] B. Ludascher, I. Altintas, C. Berkley,D. Higgins, E. Jaeger, M. Jones, E. A. Lee,J. Tao, and Y. Zhao. Scientific WorkflowManagement and the Kepler System. Dis-tributed and Parallel Systems, 2004. (sub-mitted for publication).

[LAHS03] C. Lutz, C. Areces, I. Horrocks, and U. Sat-tler. Keys, Nominals, and Concrete Do-mains. In Proc. of the 18th Intl. Joint Conf.on Artificial Intelligence IJCAI, 2003.

[Len02] M. Lenzerini. Data Integration: A Theoret-ical Perspective. Tutorial at PODS, 2002.

[Lev00] A. Levy. Logic-Based Techniques in DataIntegration. In J. Minker, editor, LogicBased Artificial Intelligence, pp. 575–595.Kluwer, 2000.

[LGM01] B. Ludascher, A. Gupta, and M. E. Mar-tone. Model-Based Mediation with DomainMaps. In 17th Intl. Conf. on Data Engineer-ing (ICDE), Heidelberg, Germany, 2001.IEEE Computer Society.

[LGM03] B. Ludascher, A. Gupta, and M. E. Mar-tone. A Model-Based Mediator System forScientific Data Management. In Z. Lacroixand T. Critchlow, editors, Bioinformatics:Managing Scientific Data. Morgan Kauf-mann, 2003.

[LHL+98] B. Ludascher, R. Himmeroder, G. Lausen,W. May, and C. Schlepphorst. ManagingSemistructured Data with FLORID: A De-ductive Object-Oriented Perspective. Infor-mation Systems, 23(8):589–613, 1998.

[LL03] K. Lin and B. Ludascher. A System forSemantic Integration of Geologic Maps viaOntologies. In Semantic Web Technologiesfor Searching and Retrieving Scientific Data(SCISW), Sanibel Island, Florida, 2003.

[LLB+03] K. Lin, B. Ludascher, B. Brodaric, D. Seber,C. Baru, and K. A. Sinha. Semantic Media-tion Services in Geologic Data Integration:A Case Study from the GEON Grid. In Ge-ological Society of America (GSA) AnnualMeeting, volume 35(6), November 2003. .

[LLBB03] B. Ludascher, K. Lin, B. Brodaric, andC. Baru. GEON: Toward a Cyberinfras-tructure for the Geosciences – A Prototypefor Geological Map Interoperability via Do-main Ontologies. In Workshop on DigitalMapping Techniques. AASG and U.S. Geo-logical Survey, June 2003. .

[LP95] E. A. Lee and T. Parks. DataflowProcess Networks. Proceedings of theIEEE, 83(5):773–799, May 1995. http:

//citeseer.nj.nec.com/455847.html.

[NCE] NCEAS. Ecological Metadata Language(EML). http://knb.ecoinformatics.org/software/eml/.

[NL04a] A. Nash and B. Ludascher. ProcessingFirst-Order Queries with Limited AccessPatterns. In ACM Symposium on Princi-ples of Database Systems (PODS), Paris,France, June 2004. .

http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf

http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf

www.geongrid.org

http://www.bgs.ac.uk/bgsrcs/home.html

http://www.bgs.ac.uk/bgsrcs/home.html

http://kepler-project.org

http://citeseer.nj.nec.com/455847.html

http://citeseer.nj.nec.com/455847.html

http://knb.ecoinformatics.org/software/eml/

http://knb.ecoinformatics.org/software/eml/

REFERENCES 21

[NL04b] A. Nash and B. Ludascher. ProcessingUnions of Conjunctive Queries with Nega-tion under Limited Access Patterns. In 9thIntl. Conf. on Extending Database Technol-ogy (EDBT), LNCS 2992, pp. 422–440, Her-aklion, Crete, Greece, 2004.

[OWL03] OWL Web Ontology Language Reference,W3C Proposed Recommendation, Decem-ber 2003. www.w3.org/TR/owl-ref/.

[PAGM96] Y. Papakonstantinou, S. Abiteboul, andH. Garcia-Molina. Object Fusion in Me-diator Systems. In Proc. of the 22nd Intl.Conf. on Very Large Data Bases (VLDB),pp. 413–424, 1996.

[RB01] E. Rahm and P. A. Bernstein. A survey ofapproaches to automatic schema matching.VLDB Journal, 10(4):334–350, 2001.

[ROA] ROADNet: Real-time Observatories, Ap-plications and Data management Network.roadnet.ucsd.edu.

[SDM] Scientific Data Management Center (SDM).http://sdm.lbl.gov/sdmcenter/, see alsohttp://www.npaci.edu/online/v5.17/

scidac.html.

[SEE] NSF/ITR: Enabling the Science Environ-ment for Ecological Knowledge (SEEK).seek.ecoinformatics.org.

[She98] A. Sheth. Changing Focus on Interoper-ability in Information Systems: From Sys-tem, Syntax, Structure to Semantics. InM. Goodchild, M. Egenhofer, R. Fegeas,and C. Kottman, editors, InteroperatingGeographic Information Systems, pp. 5–30.Kluwer, 1998.

[SQDO02] L. Struik, M. Quat, P. Davenport, andA. Okulitch. A Preliminary Schemefor Multihierarchical Rock Classificationfor use with Thematic Computer-basedQuery Systems. Technical Report 2002-D10, Geological Survey of Canada, 2002.http://www.nrcan.gc.ca/gsc/bookstore/

free/cr_2002/D10.pdf.

[Usc98] M. Uschold. Knowledge level modelling:concepts and terminology. The KnowledgeEngineering Review, 13(1):5–29, 1998.

[VP00] V. Vassalos and Y. Papakonstantinou. Ex-pressive Capabilities Description Languagesand Query Rewriting Algorithms. Journalof Logic Programming, 43(1):75–122, 2000.

[VSSV02] U. Visser, H. Stuckenschmidt, G. Schus-ter, and T. Vogele. Ontologies for Ge-ographic Information Processing. Infor-mation Processing, Computers and Geo-sciences, 28(1):103–117, 2002.

[Wie92] G. Wiederhold. Mediators in the Architec-ture of Future Information Systems. IEEEComputer, 25(3):38–49, 1992.

[WSD03] Web Services Description Lan-guage (WSDL) Version 1.2. http:

//www.w3.org/TR/wsdl12, June 2003.

[XML01] XML Schema, W3C Recommendation.www.w3.org/TR/xmlschema-1,2, May2001.

www.w3.org/TR/owl-ref/

roadnet.ucsd.edu

http://sdm.lbl.gov/sdmcenter/

http://www.npaci.edu/online/v5.17/scidac.html

http://www.npaci.edu/online/v5.17/scidac.html

seek.ecoinformatics.org

http://www.nrcan.gc.ca/gsc/bookstore/free/cr_2002/D10.pdf

http://www.nrcan.gc.ca/gsc/bookstore/free/cr_2002/D10.pdf

http://www.w3.org/TR/wsdl12

http://www.w3.org/TR/wsdl12

www.w3.org/TR/xmlschema-1,2

Managing Scientiﬁc Data: From Data Integration to...

Documents

Transcript of Managing Scientiﬁc Data: From Data Integration to...