Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C....

23
Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence University of Illinois at Chicago University of British Columbia Okanagan ODBASE 2006, Montpellier, France
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C....

Page 1: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

Reducing the Cost of Validating Mapping Compositions by

Exploiting Semantic Relationships

Reducing the Cost of Validating Mapping Compositions by

Exploiting Semantic Relationships

Eduard C. Dragut

Ramon Lawrence

Eduard C. Dragut

Ramon Lawrence

University of Illinois at Chicago

University of British Columbia Okanagan

University of Illinois at Chicago

University of British Columbia Okanagan

ODBASE 2006, Montpellier, France

Page 2: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 2

Talk Overview Introduction Background

Model and Mapping representation systems Proposed Mapping Representation System

Invert and Compose operator definitions and properties

Mappings Composition Experiment

Estimate the quality of the proposed system

Page 3: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 3

Modelsdenote a representation of a domain in a formal

language (e.g., EER, Relational, Description Logic)has two components [Russell et al 2003]

terminological (or metadata) This is the focus of this work and talk.This is the focus of this work and talk.

extensional (i.e. facts or instances) Mappings

describe how two models are related to each other

Introduction - Terminology

Page 4: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 4

Ways to define mappings between models binary relationships

called morphismsmorphisms [Melnik et al 2003] or inter-schema correspondencesinter-schema correspondences [Popa et al. 2002]

mapping using a helper model [Bernstein et al. 2003]

mapping as queries [Madhavan et al. 2003, Berstein et al. 2006]

Our work falls in the class of the first two types of mappings. We call them metadata level mappings.metadata level mappings.

They are not concerned with the instances of a model.

Introduction - Mappings

Page 5: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 5

Examples of models diagrams, interface definitions, database schemas,

web site layouts, control flow, XML schemas Applications of mappings

mapping between XML schemas to drive message translation;

schema and database integration;mapping between ontologies to help in the process of

merging and alignment

Introduction - Examples

Page 6: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 6

The creation will be rarely completely automated. General strategy is to semi-automatically build mappings

use heuristics to generate matchingsmatchings (e.g. name similarity) [Rahm and Bernstein 2001, Shvaiko and Euzenat 2005] (surveys)

translate matches into formulas E.g., Clio project [Popa et al. 2002]

generate new mappings from existing mappings Composition Composition

E.g, [Madhavan et al. 2003, Berstein et al. 2006] InvertInvert

E.g, [Fagin 2006] Semi-automatic tools can significantly speed up the

process.

Background - Mapping creation

Page 7: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 7

Background - Morphisms Mapping:

is just a set of binary relations between the elements of two models is a set of pairs < l, r >

Advantages/Disadvantages their expressiveness is enough for certain classes of problems and they

exhibit certain mathematical properties [Melnik et al. 2003] main drawback

assumes similarity to be transitiveassumes similarity to be transitive

CREATE TABLE Actor1(ActID int PRIMARY KEY, Bio varchar, ActorName varchar)

<schema xmlns=”...”> <complexType name=”Actor2”> <element name=”ActorID” type=”xs:int”/> <element name=”Bio” type=”xs:string”/> <element name=”FirstName” type=”xs:string”/> <element name=”LastName” type=”xs:string”/> </complexType></schema >

Page 8: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 8

Actor2

ActorID

Bio

FirstName

Actor

Bio

ActID

ActorName LastName

Actor1

Bio

ID

MiddleName

FirstName

LastName

LastMovie

RecentMovie

Background – Morphisms problems Composition

<ID, ActID> ○ <ActID, ActorID> = <ID, ActorID>

due to transitivity assumption Problems with this technique

Whenever m:1 correspondence is composed with a 1:n correspondence, the composition result is a cross-product; many being false positives.

It may miss or suggest false relationships.

Actor2

ActorID

Bio

FirstName

LastName

Actor1

Bio

ID

MiddleName

FirstName

LastName

LastMovie

RecentMovie

Legend: Blue correct Red false positive or missed

Page 9: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 9

Background – Mapping with helper models

Algorithm (for right compose)[Bernstein et al 2002]

copy the right hand side mapping for each mapping element, m, on the right, i.e. in map2

compute its Input(m) for each mapping element, m, on the right, i.e. in map2

set its domain to the union of the domains of Input(m)

Actor2

ActorID

Bio

FirstName

Actor

Bio

ActID

ActorName LastName

Actor1

Bio

ID

MiddleName

FirstName

LastName

LastMovie

RecentMovie

map1

m2

m4

m5m6

m3

m7

map2

m2

m4

m5m6

m3

Example:

Page 10: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 10

Background – Mapping with helper models Composition result

Actor2

ActorID

Bio

FirstName

LastName

Actor1

Bio

ID

MiddleName

FirstName

LastName

LastMovieRecentMovie

map2

m2

m3

Actor2

ActorID

Bio

FirstName

Actor

Bio

ActID

ActorName LastName

Actor1

Bio

ID

MiddleName

FirstName

LastName

LastMovie

RecentMovie

map1

m2

m4

m5m6

m3

m7

map2

m2

m4

m5m6

m3

Problems with this technique It may miss or suggest false

relationships.

Legend: Red missed relationships

Page 11: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 11

The Objectives The driving motivationdriving motivation

The need for a mapping definition subsuming the relationship kinds that the state of the art matching algorithms discover with high precision.

Investigate to what extent a set of operations over this mapping definition can be defined.

Provide a mapping representationa mapping representation at the metadata level combining the advantages of morphisms and mappings with helper models. The former has good mathematical properties. The latter is more expressive.

Provide a compose algorithma compose algorithm that exploits the semantic relationships within the mapping expression to produce correct semantic relationships whenever these can be determined automatically and to isolate those instances that require human intervention.

Page 12: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 12

Proposed Mappings Representation Model

A model has similar expressiveness as an EER model and is consistent with the definition of model used in previous work on model management. [Bernstein et al 2002, Pottinger and Bernstein 2003]

Mapping Representation A mapping consists of a set of mapping elements, each mapping element is

a directed, kinded binary relationshipdirected, kinded binary relationship between a pair of elements not in the same model:

Triplets of form < m1,type,m2 >, type = {IsA, AKindOf, HasA, PartOf, =, Contains, ContainedBy, Unknown, Complex}

Comments Some of these types were introduced in other works.

E.g, [Euzenat 2004, Giunchiglia et al. 2004, Pottinger and Bernstein 2003, Xu and Embley 2003, Wu et al. 2004]

Page 13: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 13

Proposed Mappings Representation

PO

Product

ShipTo

FirstName

Street1

POrder

Article

ShipAddress

RecipientLastName

Street2

OrganizationShipper

Phone WorkPhoneLegend

HomePhone

Equality

Contains

IsA

HasA

An example Most of the relationship kinds in the

mapping representation are well-known except for UnknownUnknown and ComplexComplex

<a,Unknown,b >, means that the relationship between concept a and b is not precisely known.

< a,Complex,b >, the relationship between concept a and b may require a functional specification: a = f(b)

e.g., Price = PriceVat(VAT + 1)

Page 14: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 14

Operators - Invert Each of the relationship types introduced have well defined

inversion properties: IsA inverted is AKindOf, HasA inverted is PartOf, Contains inverted is

ContainedBy

Definition [invert for mapping elementsinvert for mapping elements]: Consider m = < a,type,b >. Then its corresponding inverted mapping element,

denoted m-1, is given by the following expression: < b, type-1,a> Mathematical form: < a,type, b >-1 = < b, type-1,a >

E.g. < a,HasA,b >-1 = < b,HasA-1,a >

Definition [invert for mappingsinvert for mappings]: Given two models A and B and a mapping, map, between them, the invert of

map denoted by map-1, is defined from B to A and its expression is given by:

map-1 = {< b,type-1,a >| < a,type,b > map}

Page 15: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 15

Operators - Compose Composing two mappings involves defining a composition

operation between the elements of the mappings (i.e. between triplets of form < a,type,b >)

Example <HomePhone, IsA, Phone> ○ <Phone, = , Telephone> =

< HomePhone, IsA, Telephone>

Page 16: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 16

Compose Properties Remarks:

The result of composing two mappings where mapping elements are expressed as triplets < a,type,b > is closed.

Mapping composition is symmetric in this framework:

(<a, type,b > ○ < b,type,c >)-1 =< c,type-1,b > ○ < b,type-1,a> The result of composing two mappings does not produce false does not produce false

correspondencescorrespondences between the elements of the two models, i.e. it does not suggest false directed, kinded relationships.

The Compose operator uses the Unknown relationship to indicate when it is not possible (in general) to suggest a relationship type given only the information expressed in the two mappings.

Page 17: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 17

Experiment - Setup Experiment goal:

Show that the composition framework is robust when applied to real world application and that we are able to correctly identify problematic cases.

We compare it against mappings as morphisms.

Five real-world XML schemas in the purchase order domain:CIDR, Excel, Noris, Paragon, and Apertum from www.biztalk.org They were used in other projects:

[Dragut and Lawrence 2004, Madhavan et al. 2001]

And a reference ontology to which each XML schema is manually mapped both using morphisms

[Dragut and Lawrence 2004] and using the new mapping definition.

Page 18: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 18

Experiment - Setup

Excel Purchase Order

Footer Header

totalValue orderNum

orderDate

ourAccountCode

yourAccountCode

Contact

contactName

companyName

e-mail

telephone

InvoiceTo

city

country

Address

stateProvince

street1

street2

street3

street4

postalCode

DeliverTo

Items

Item

itemCount

partNumber

yourPartNumber

itemNumber

partDescription

quantity

unitOfMeasure

unitPrice

salesValue

Example of XML schemas: XML Excel and CIDR schemas

CIDR

ContactPOHeader

POShipTo

contactName

contactFunctionCode

contactEmail

city

contactPhone

POShipTo

attn

country

poDate

poNumber

entityidentifier

stateProvince

street1

street2

street3

street4

postalCode

city

attn

country

entityidentifier

stateProvince

street1

street2

street3

street4

postalCode

POLines

Item

count

startAt

qty

unitPrice

uom

partno

line

Page 19: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 19

Experiment - Intermediary model

contactPerson

POrder

Agent

PurchaseOrder

HomePage

Fax

Email

PurchaseOrderDate

Phone

Amount

Currency

Discount

ShipmentDate

Comments

OrderDate

City

Country

Address

State

Street

Zip

hasAddress

Person

LastName

FirstName

Personnel

Title

Position

Organization

Supplier Shipper

OrderNumber

billTo

OrganizationName

ItemsCollectionhasItems

PurchasedItem

ItemDescription

PartNumber

ItemName

UPC

Quantity

Price

hasItem

suppliedBy shippedBy

shipTo

Comments: The intermediary model does

not have all concepts in the schemas (e.g. unitOfMeasure, count, and VAT).

The intermediary model is structurally different from the five schemas considered and it is defined using OWL.

Page 20: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 20

Experiment - Methodology Step 1: map the five schemas to the intermediary model:

First, using morphisms Second, using the proposed mapping

Step 2: apply the compose operators to compute direct mappings between the schemas First, employing composition over morphisms Second, using the new compose operator

Step 3: measure the quality of the two compositions in terms of Precision, Recall, and Overall A new metric is introduced User Effort.User Effort.

Page 21: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 21

Experiment - Stats Overall after composition was computed

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1<->2 1<->3 1<->4 1<->5 2<->3 2<->4 2<->5 3<->4 3<->5 4<->5

Ov

era

ll

ours

morphisms

CIDR, Excel, Noris, Paragon, and Apertum are assigned numbers 1, 2, 3, 4, and 5 respectively.

Page 22: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 22

User effortUser effort is the % of mappings that must be validated by a user. For morphisms, user effort is 100% as there is no way to distinguish true

over false relationships. In our framework, it is the ratio of the number of Unknown relationships to the

number of all produced relationships. On averageaverage it is only 19%.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1<->2 1<->3 1<->4 1<->5 2<->3 2<->4 2<->5 3<->4 3<->5 4<->5

% correct (semantic) % unknown (semantic)% correct (morphism) % incorrect (morphism)

Experiment - Stats

Page 23: Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.

E. Dragut and R. Lawrence -Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Page 23

End

Thank you for your time and patience!