Data Exchange with Data-Metadata Translations MAD Algorithm Paolo Papotti Mauricio A. Mauricio A....

40
Data Exchange with Data-Metadata Translations MAD Algorithm Paolo Paolo Papotti Papotti Mauricio Mauricio A. A. Hernández Hernández Wang-Chiew Wang-Chiew Tan Tan

Transcript of Data Exchange with Data-Metadata Translations MAD Algorithm Paolo Papotti Mauricio A. Mauricio A....

Data Exchange with Data-Metadata Translations

Data Exchange with Data-Metadata Translations

MAD Algorithm

Paolo Paolo PapottiPapotti

Mauricio A. Mauricio A. HernándezHernández

Wang-ChiewWang-ChiewTanTan

Data ExchangeData Exchange

“Scientia potentia est”

• What is Data Exchange?:• The process of taking data built under a

source schema and transforming it into data built under a target schema

• Data Exchange is the restructuring of data

Data Exchange – why?Data Exchange – why?

1. Today when companies merge they also merge information sources.

Data Exchange – why?Data Exchange – why?

2. When several institutions are working on a joint venture – a combined database is

Data Exchange – why?Data Exchange – why?

3. Refreshing and updating data base scheme

Few problems with data exchange

1. The labels in the Source Schema and the values Target Schema could be very different

2. Data could be kept in a plethora of waysFor instance: Car price could be stored in Shekels and in U.S dollars

3. Data could be lost in the exchange process if the Source Schema and Target Schema don’t correspond well

Data ExchangeData Exchange

In the past Data Exchange was done manually, taking many resources

such as time and money.

Many researchers struggle with ways of improving data exchange

Location List-price Automobile

Seniority

Agent- name

Belfast, NR 650000 Morris 8 2 Gerry Adams

Newry, NR 500000 Bentley Mark V

1 Martin McGuiness

Id Name Car model Commission

48 Nigel Dodds Vauxhall 14 0.03

66 Ian Paisley Ford T 0.04

Schema Clunkers –R-Us

Schema Buy-A-Wreck

cars

Car AGENTS

Clunker table

Antique Car DealershipAntique Car Dealership

Car Model price Agent-id

Vauxhall 14 360,000 48

Ford Model T 430,000 66

Schema Clunkers –R-Us

Schema Buy-A-Wreck

Name

Nigel Dodds

Ian Paisley

Agent- name

Nigel Dodds

Ian Paisley

Matching Examples

Car model

Vauxhall 14

Ford T

Automobile

Vauxhall 14

Ford T

Schema Clunkers –R-Us

Schema Buy-A-Wreck

Matching Examples

Matching Examples

Car type

price Agent-id

Vauxhall

14 360,000 48

Ford Model T

430,000 66

Id Commission

48 0.03

66 0.04

Schema Buy-A-Wreck

cars

Car AGENTS

List-price Car model

370800 Vauxhall 14

447200 Ford Model T

Schema Clunkers –R-Us

• Creating mappings:1. schema matching: find matches

2. create query expressions: for automated data translation or exchange

How do we match?

SchemaMatching

Create Query expressions

Data ExchangeData Exchange

1.There may be no way to transform an instance given all of our constraints.

2. There may be numerous ways to transform the instance (possibly infinitely many).

3.We must identify and justify a best suited choice of solutions for our need.

 

S T

Source schema S

Target schema

T

Data Exchange - SummeryData Exchange - Summery

To conclude:1. Data exchange is exchanging data from a Source Schema to a Target Schema2.It is a greatly dealt problem in the computerized world3. Some Data exchange scenarios deal with Metadata

What is Metadata?What is Metadata?

•Metadata: Data on Data.

Metadata can come as: Video

Audio

Image

Text

Why Do we need Meta – Data?Why Do we need Meta – Data?

Meta-Data helps us to understand data

Can anyone tell what these numbers mean?

Jan 120 223 89Feb 83 168 56

Why Do we need Meta – Data?Why Do we need Meta – Data?

Umbrella SalesMonth USA UK Italy Jan 120 223 89 Feb 83 168 56

After adding Meta-Data…

Why Do we need Meta – Data?Why Do we need Meta – Data?

We all know this picture…

Why Do we need Meta – Data?Why Do we need Meta – Data?

What is this picture all about?

Why Do we need Meta – Data?Why Do we need Meta – Data?

Sir Edward Carson signing the Ulster Covenant

Why Do we need Meta – Data?Why Do we need Meta – Data?

Why Do we need Meta – Data?Why Do we need Meta – Data?

Wall Street, New York City, New York.

23

• Data exchange scenarios may involve metadata transformations.

Data-Metadata TranslationsData-Metadata Translations

• Transforming the data in the Stock Ticker table to metadata in the Stock Quotes table is vital in the stock exchange world.

Data-Metadata TranslationsData-Metadata Translations

• Mapping systems support Data-to-Data transformations with fixed schemas (Clio).

• Goal: Extend mapping systems to support Data-Metadata Translations.

Data Exchange ClioData Exchange Clio

• One software developed for simple graphic data exchange is “Clio”

• Clio corresponded values between the source scheme and the target scheme

• However, the Clio solution did not provide answers for possible data exchange scenarios that involve Metadata

• the solution involving Metadata is based on Clio

Clio interfaceClio interface

27

Source.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56

Target.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

m1: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “USA” and $t.units = $s.USA

Metadata-to-DataMetadata-to-Data

Source: Rcd Sales: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

How can we transform the following source data into the corresponding target?

Schema mapping m1

“USA”

28

Source.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56

Target.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

m1: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “USA” and $t.units = $s.USA

m2: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “UK” and $t.units = $s.UK

Metadata-to-DataMetadata-to-Data

Source: Rcd Sales: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

How can we transform the following source data into the corresponding target?

Schema mapping m2

“UK”

29

Source.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56

Target.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

m1: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “USA” and $t.units = $s.USA

m2: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “UK” and $t.units = $s.UK

m3: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “Italy” and $t.units = $s.Italy

Metadata-to-DataMetadata-to-Data

Source: Rcd Sales: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

How can we transform the following source data into the corresponding target?

Schema mapping m3

“Italy”

30

Source: Rcd Sales: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

countries label value

Select the elements to group

Placeholder Copy elements’

values

Copy elements’ labels

Source.Sales Jan 120 223 89 Feb 83 168 56

Target.Sales Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

Set of labels (strings)

Dynamic selection of the source

element

Is a label value

for $s in Source.Sales, $c in {“USA”, “UK”, “Italy”}{“USA”, “UK”, “Italy”}exists $t in Target.Saleswhere $t.month = $s.month and $t.country = $c and $t.units = $s.($c)

MetadatA-Data (MAD) mapping:

Metadata-to-Data: Our solutionMetadata-to-Data: Our solution

31

Target: Rcd Stockquotes: SetOf Rcd time symbols label value

Source: Rcd StockTicker: SetOf Rcd time symbol price Dynamic

element

Now we want to support the opposite operation

The target schema depends on the source data

We define a target template: Nested Dynamic Output Schemas (ndos)

Run-time: The dynamic element defines the target instance and the target schema.

Data-to-MetadataData-to-Metadata

StockTicker (time: 0900, Symbol : MSFT, Price: 27.20 ) StockTicker (time: 0900, Symbol : IBM, Price: 120.00 ) StockTicker (time: 0905, Symbol : MSFT, Price: 27.30 )

There are two possible interpretations for the target ndos:

Consider this mapping and this source instance:

Stockquotes (time: 0900, MSFT: 27.20 ) Stockquotes (time: 0900, IBM: 120.00 ) Stockquotes (time: 0905, MSFT: 27.30 )

Target: Rcd Stockquotes: SetOf Rcd time symbols: Choice MSFT IBM

Computed Target Instance

Source Instance

First alternative: Heterogeneous target records

Computed Target Schema

Data-to-Metadata: Heterogeneous recordsData-to-Metadata: Heterogeneous records

Target: Target: RcdRcd Stockquotes: Stockquotes: SetOf RcdSetOf Rcd timetime symbolssymbols labellabel valuevalue

Source: Source: RcdRcd StockTickerStockTicker: : SetOf RcdSetOf Rcd timetime symbolsymbol priceprice

Target: Target: RcdRcd Stockquotes: Stockquotes: SetOf RcdSetOf Rcd timetime symbolssymbols labellabel valuevalue

Source: Source: RcdRcd StockTickerStockTicker: : SetOf RcdSetOf Rcd timetime symbolsymbol priceprice

StockTicker (time: 0900, Symbol : MSFT Price: 27.20 ) StockTicker (time: 0900, Symbol : IBM Price: 120.00 ) StockTicker (time: 0905, Symbol : MSFT Price: 27.30 )

There are two possible interpretations for the target ndos:

Data-to-Metadata: Homogenous recordsData-to-Metadata: Homogenous records

Consider this mapping and this source instance:

Computed Target Instance

Source Instance

Computed Target SchemaTarget: Rcd Stockquotes: SetOf Rcd time MSFT IBM

Stockquotes (time: 0900, MSFT: 27.20, IBM: null ) Stockquotes (time: 0900, MSFT: null , IBM: 120.00 ) Stockquotes (time: 0905, MSFT: 27.30, IBM: null )

Second alternative: Homogeneous target records

34

The Homogenous approach is a MAD improvemnet

Stockquotes(time: 0900, MSFT : 27.20, IBM: null ) Stockquotes(time: 0900, MSFT : null , IBM: 120.00) Stockquotes(time: 0905, MSFT : 27.30, IBM: null )

Homogeneity Constraint:“For every pair of tuples t1 and t2, if a is a label in t1, then a is a label in t2”

Stockquotes(time: 0900, MSFT : 27.20 ) Stockquotes(time: 0900, IBM : 120.00 ) Stockquotes(time: 0905, MSFT : 27.30 )

Natural solution for semi-structured data models (XSD, DTD, JSON)

Data-to-Metadata: Homogenous recordsData-to-Metadata: Homogenous records

Target: Target: RcdRcd Stockquotes: Stockquotes: SetOf RcdSetOf Rcd timetime symbolssymbols labellabel valuevalue

Source: Source: RcdRcd StockTickerStockTicker: : SetOf RcdSetOf Rcd timetime symbolsymbol priceprice

MAD MappingMAD Mapping

MetadatA-Data(MAD) mapping three steps:

1. Preliminary mapping

How do we map the Source schema to the Target schema

Preliminary mapping for <<D>> includes the metadata label and the value label of <<D>>.

36

Source: Rcd SalesByCountries: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

countries label value

{ $x1 Source.SalesByCountries, $x2<<countries>>; $x3=$x1.($x2) }

Target.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56

Source.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

Preliminary Mapping

Label Value Transfer

37

MAD MappingMAD Mapping

2. Skeletons:

n x m matrix of skeletons is constructed for the set of source preliminary mapping and the set of target preliminary mapping while each entry(i,j) can be potential mapping.

3. Creating MAD Mapping:

At this stage, the value correspondences need to be matched against the preliminary mapping in order to factor them into the appropriate skeletons.

Source.Sales.country Target.CountrySales.country

Matched against one or more

source mappings

Matched against one or more target

mappings

Source.SalesByCountries.<<countries>> Target.Sales.countrySource.SalesByCountries.&<<countries>> Target.Sales.units

MAD Mapping Generation ExampleMAD Mapping Generation Example

Source: Rcd SalesByCountry: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

countries label value

Source : { $x1 Source.SalesByCountry, $x2<<countries>>; $x3:=$x1.($x2) }

Target : { $y1 Target.Sales}

Source schema S

Target schema T

Declarative (internal) representation

GUI

Executable code (XSLTXSLT, XQuery, JavaJava)

New construct to iterate over elements’ labels: placeholder

Target schema can be incomplete: nested dynamic output schema (ndos)

New mapping & query generation algorithms

Data exchange with data-metadata support: Data to Data is a special case

MAD vs ClioMAD vs Clio

40

Fin.Fin.