Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

28
GEON AHM, April 16-18, SDSC CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Towards Semantic Mediation for GEON: Facilitating Scientific Data Integration using Knowledge Representation Bertram Ludäscher Bertram Ludäscher ludaesch ludaesch @sdsc. @sdsc. edu edu Data and Knowledge Systems Data and Knowledge Systems San Diego Supercomputer Center San Diego Supercomputer Center U.C. San Diego U.C. San Diego

description

Towards Semantic Mediation for GEON: Facilitating Scientific Data Integration using Knowledge Representation. Bertram Ludäscher [email protected] Data and Knowledge Systems San Diego Supercomputer Center U.C. San Diego. +/- Energy. GEON Metamorphism Equation:. - PowerPoint PPT Presentation

Transcript of Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

Page 1: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Towards Semantic Mediation for GEON:

Facilitating Scientific Data Integration using Knowledge Representation

Bertram LudäscherBertram Ludäscher ludaeschludaesch@[email protected]

Data and Knowledge SystemsData and Knowledge SystemsSan Diego Supercomputer CenterSan Diego Supercomputer Center

U.C. San DiegoU.C. San Diego

Page 2: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Acknowledgements“Smart” Geologic Map Prototype:

Kai [email protected]

Data and Knowledge SystemsSan Diego Supercomputer Center

Geo-Knowledge-Engineer:Boyan Brodaric

[email protected] Resources Canada

... and many GEONites :Dogan, Krishna, ..., State Geologic

Surveys, Chaitan, Ilya, Michalis, Ashraf, ... (upcoming demo)

Geoscientists + Computer Scientists Igneous Geoinformaticists+/- Energy

GEON Metamorphism Equation:

Page 3: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

GEON and “Semantic” Data Integration

Rocky Mountains

Midatlantic Region

Page 4: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

What is Knowledge Representation? Relating Theory to the World via Formal Models

Source: John F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational Foundations

““All models are wrong, but some are useful!”All models are wrong, but some are useful!”

Page 5: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

What is (an) “Ontology” ??? (... what CS graduate students need to know ...)

1. Ontology as a philosophical discipline1. Ontology as a philosophical discipline2. Ontology as a an informal conceptual system2. Ontology as a an informal conceptual system3. Ontology as a formal semantic account3. Ontology as a formal semantic account4. Ontology as a specification of a “conceptualization”4. Ontology as a specification of a “conceptualization”5. Ontology as a representation of a conceptual system5. Ontology as a representation of a conceptual systemvia a logical theoryvia a logical theory

5.1 characterized by specific formal properties5.1 characterized by specific formal properties5.2 characterized only by its specific purposes5.2 characterized only by its specific purposes

6. Ontology as the vocabulary used by a logical theory6. Ontology as the vocabulary used by a logical theory7. Ontology as a (meta-level) specification of a logical theory7. Ontology as a (meta-level) specification of a logical theory

http://ontology.ip.rm.cnr.it/Papers/KBKS95.pdf[Guarino’95]

Page 6: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

What is an Ontology? (CSE-291 cont’d ;-)

• Given a logical language L ...Given a logical language L ...– ... a conceptualization is a set of models of L which describes

the admittable (intended) interpretations of its non-logical symbols (the vocabulary)

– ... an ontology is a (possibly incomplete) axiomatization of a conceptualization.

conceptualization conceptualization C(L)C(L)

ontologyontology

set of all models M(L)set of all models M(L)logiclogictheoriestheories

[Guarino96]http://www-ksl.stanford.edu/KR96/Guarino-What/P003.html

Page 7: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

Problem: Scientific Data Integration ... from Questions to Queries ...

What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ?

How does it relate to host rock structures?

?Information Integration

Geologic Map(Virginia) GeoChemical

GeoPhysical(gravity contours)

GeoChronologic(Concordia)

Foliation Map(structure DB)

“Complex Multiple-Worlds”

Mediation

domain knowledge

Database mediationData modeling

Knowledge Representation:ontologies, concept spaces

raw data

Page 8: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Got Glue? Which one? What for? • XML (common syntax)XML (common syntax)

– flexible (semistructured) data model– used at all levels: data / metadata exchange, message exchange (SOAP), schemas & data

types (XML Schema), Semantic Web & web ontologies (RDF(S), OWL), ...• Grid infrastructure (system interoperation)Grid infrastructure (system interoperation)

– distributed computing and data management– web services

• Controlled Vocabularies (“joins”)Controlled Vocabularies (“joins”)– data level: joins across different data sets– but meta-data and ontologies (concept names, relationship names, ...) are also data!

• Integrated View Definitions (mediated views/virtual databases)Integrated View Definitions (mediated views/virtual databases)– declarative specification of “integration logic”: XQuery, Datalog, ...

• Thesauri (translator for retrieving related information)Thesauri (translator for retrieving related information)– synonyms, broader/narrow term, e.g., UMLS (meta-thesaurus, “ontology”)

• Taxonomies (classification)Taxonomies (classification)– shared vocabulary, concept hierarchy (is-a)

• Ontologies (classification + additional semantics):Ontologies (classification + additional semantics):– formal specification of a conceptualization, shared meaning – facilitates “smart querying”, semantic mediation

Page 9: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Information Integration Challenges• System aspects: “Grid” Middleware

• distributed data & computing• Web Services, WSDL/SOAP, OGSA, …• sources = functions, files, data sets, …

• Syntax & Structure: (XML-Based) Data Mediators

• wrapping, restructuring • (XML) queries and views• sources = (XML) databases

• Semantics: Model-Based/Semantic Mediators

• conceptual models and declarative views • Knowledge Representation: ontologies,

description logics (RDF(S),OWL ...)• sources = knowledge bases (DB+CMs+ICs)

Syntax

Structure

Semantics

System aspects

reconciling reconciling SS44 heterogeneitiesheterogeneities

““gluing” together multiple gluing” together multiple data sources data sources

bridging information and bridging information and knowledge gaps knowledge gaps computationallycomputationally

Page 10: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Standard (XML-Based) Mediator Architecture

MEDIATOR

(XML) Queries & Results

S1

Wrapper

(XML) View

S2

Wrapper

(XML) View

Sk

Wrapper

(XML) View

Integrated Global(XML) View G

Integrated ViewDefinition

G(..) S1(..)…Sk(..)

USER/ClientUSER/Client Query Q ( G (SQuery Q ( G (S11,..., S,..., Skk) )) )

wrappers implementedas web services

Page 11: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

XML-Based vs. Semantic Mediation

Raw DataRaw DataRaw Data

IF THEN IF THEN IF THEN

Semantics,Constraints in Logic

Integrated-CM :=

CM-QL(Src1-CM,...)

. . ....

....

........ (XML)Objects

Conceptual Models

XMLElements

XML Models

C2 C3

C1

R

Classes,Relations,is-a, has-a, ...

“Glue Maps” ontologies, concept spaces

Integrated-DTD :=

XQuery(Src1-DTD,...)

No Semantics /Domain Constraints

A = (B*|C),DB = ...

Structural Constraints (DTDs),Parent, Child, Sibling, ...

CM ~ {Descr.Logic, ER, UML, RDF(S), …} CM-QL ~ {F-Logic, …}

0.0155381,1.54906,2,140,29,Tertiary,Trc,CHINLE FORMATION,59,57

Page 12: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

GEON Framework for Interoperability in the Geosciences

• Systems levelSystems level: GEON Grid ... : GEON Grid ... – enable sharing of data and tools via grid services– based on Open Grid Services Architecture (OGSA)– acquisition of cluster endpoints and initial deployment at some sites

underway, including SDSC, UTEP, VT, ..., • Syntactic and schema levelSyntactic and schema level: Data integration via (meta)data : Data integration via (meta)data

standards (often XML-based) standards (often XML-based) – database mediators create integrated virtual databases=> dynamic creation and automatic update of data-warehouses

• Semantic levelSemantic level: data integration via “semantic” mediation: data integration via “semantic” mediation– Situating 4-D data in context spatio-temporal, thematic, process

contexts can be represented as “concept spaces”– specifically: use of ontologies, and logic-based knowledge representation– development guided/driven by specific scientific data integration problems

Page 13: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Towards Shared Conceptualizations: High-level Domain Ontology & Standard Data Model

Source: NADAM Team(Boyan Brodaric et al.)

Adoption of a standard (meta)data model => wrap data sets into unified virtual views

Page 14: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Towards Shared Conceptualizations: Data Contextualization via Concept Spaces

Page 15: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Towards Knowledge Sharing: Rock-type “Ontology”

Composition

Genesis

Fabric

Texture

Page 16: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Biomedical InformaticsResearch Networkhttp://nbirn.net

Getting Formal: Source Contextualization & Ontology Refinement in Logic

Page 17: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Show formations where AGE = ‘Paleozic’(without age ontology)

Show formations where AGE = ‘Paleozic’

(with age ontology)

domainknowledge

Knowledge re

presentation

AGE ONTOLOGY

Nevada

Page 18: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Querying with Multiple Classifications/Ontologies:Age, Composition, Texture, Fabric, Genesis

Page 19: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

What to do with the “KR Glue”?

• Conceptual-level information, concept spaces, Conceptual-level information, concept spaces, ontologies, and other KR techniques for ...ontologies, and other KR techniques for ...– ... smart data discovery– ... browsing and querying by themes, disciplines, ...– ... defining virtual/mediated databases at conceptual level – ... support “plugging together” of “data and information

experiments” into Scientific Workflows (a.k.a. Analytical Pipelines in the SEEK ITR)– ... smarter user interfaces

is “find felsic sedimentary rocks” a meaningful (satisfiable) query?– ...

Page 20: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Some enabling operations on “ontology data”

Composition

Concept expansion:Concept expansion:• what else to look for what else to look for when asking for ‘Mafic’when asking for ‘Mafic’

Page 21: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Some enabling operations on “ontology data”

Composition

Generalization:Generalization:• finding data that is finding data that is “like” X and Y“like” X and Y

Page 22: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Towards Knowledge Sharing: Rock-type Ontology

Composition

Genesis

Fabric

Texture

Page 23: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

DEMO

... do NOT click this ...

http://kbis.sdsc.edu/GEON/ahm03-demo.htmlhttp://kbis.sdsc.edu/GEON/ahm03-demo.html

Page 24: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Architecture of Integrated Geologic Map Architecture of Integrated Geologic Map Prototype SystemPrototype System

HTTP Server(Java Server Page)

MapServer(Minnesota) Mediator

(Java application)

Database(Arizona)

Database(Montana)

Map Definition

local layer

remote layer

local layer

Global Ontology DefinitionsRock classification

Geologic age

request response

Page 25: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Data Source Wrapping and IntegrationData Source Wrapping and Integration

Arizona

Colorado

Utah

Nevada

Wyoming

New Mexico

Montana East

Idaho

Montana West

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

…… FormationFormation

…… AgeAge

…… CompositionComposition

…… FabricFabric

…… TextureTexture

…… FormationFormation

…… AgeAge

…… CompositionComposition

…… FabricFabric

…… TextureTexture

ABBREV

PERIOD

PERIOD

NAME

PERIOD

TYPE

TIME_UNIT

FMATN

PERIOD

NAME

PERIOD

NAME

FORMATION

PERIOD

FORMATION

FORMATION

LITHOLOGY

LITHOLOGY

AGE

AGE

andesitic sandstone

Livingston formation

Tertiary-Cretaceous

Page 26: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Ontology-Enabled Query ProcessingOntology-Enabled Query ProcessingUser: “Show formations from Cenozoic!”

Query RewritingQuaternary Tertiary

CenozoicAge Ontology

Arizona Montana West

TertiaryTertiary TkgmTkgm

QuaternaryQuaternary QQ

…… …………

QgQg QuaternaryQuaternary …… …… ……

TwpTwp TertiaryTertiary …… …… ……

TwlTwl TertiaryTertiary …… …… ……

PERIOD FORMATION LITHOLOGY

TkgmTkgm

QQ

QgQg

TwpTwp

TwlTwl

……

PERIOD

Color Definition

Map Rendering

select FORMATION where AGE=“Tertiary” or AGE=“Quaternary”

ABBREV

Page 27: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

Integration Challenges• MANY!MANY!• non-available or non-non-available or non-

interoperable datainteroperable data• ““Dirty data”, no controlled Dirty data”, no controlled

vocabulariesvocabularies• Many different controlled Many different controlled

vocabularies! (“clean data”)vocabularies! (“clean data”)• What is entailed by a What is entailed by a

vocabulary? vocabulary? Formal OntologiesFormal Ontologies Extensible OntologiesExtensible Ontologies

Page 28: Bertram Ludäscher ludaesch@sdsc Data and Knowledge Systems San Diego Supercomputer Center

GEON AHM, April 16-18, SDSC

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES

What’s next?• YOU!YOU!

• GEON-SCI: GEON-SCI: – Science questions waiting to be turned into queries!

• GEON-KR Working Group activitiesGEON-KR Working Group activities– guided (if not driven by) geoscientists– marry KR technologies to standards (W3C, Semantic Web: RDF, OWL, ...)– collect GEON-able KR resources (data models, controlled vocabularies,

ontologies, ...)

• GEON-DEV: GEON-DEV: – Generalize and merge current KR/semantic mediation architecture with

standard Grid architecture– building systems