CSE 636 Data Integration

Post on 21-Jan-2016

24 views 0 download

description

CSE 636 Data Integration. Overview. Fall 2006. What is Data Integration?. The problem of providing uniform (sources transparent to user) access to (query, and eventually updates too) multiple (even 2 is a problem!) autonomous (not affect the behavior of sources) - PowerPoint PPT Presentation

Transcript of CSE 636 Data Integration

CSE 636Data Integration

Overview

Fall 2006

2

What is Data Integration?

The problem of providing• uniform (sources transparent to user)• access to (query, and eventually updates too)• multiple (even 2 is a problem!)• autonomous (not affect the behavior of sources)• heterogeneous (different data models, schemas)• structured (at least semistructured)• data sources (not only databases)

3

Motivation

• Enterprise data integration; web-site construction.

• World-wide web:– comparison shopping (Netbot, Junglee)– portals integrating data from multiple sources– XML integration

• Science & culture– Medical genetics: integrating genomic data– Astrophysics: monitoring events in the sky– Environment: Puget Sound Regional Synthesis Model– Culture: uniform access to all the cultural databases

produced by different countries.

4

Principle Dimensions of Data Integration

• Virtual vs. materialized architecture• Access: query only or query&update?

– problem similar to updating through views– need distributed transactional services.

• Mediated schema: yes or no?– Mediated schema requires schema integration and

then query reformulation.– Without mediated schema, we lose some of the

advantages of data integration.

5

Data Warehouse Architecture

DataSource

DataSource

Relational Database(Warehouse)

DataSource

Users

Applications

OLAP / Decision SupportData Cubes / Data Mining

ETL Tools(Extract-Transform-Load)

Data Cleaning

6

7

Virtual Integration Architecture

• Leave the data in the sources• When a query comes in:

– Determine the relevant sources to the query– Break down the query into sub-queries for the sources– Get the answers from the sources, filter them if needed

and combine them appropriately

• Data is fresh• Otherwise known as

On Demand Integration

8

Virtual Integration Architecture

End Users

Applications

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Design-Time

SchemaMappingsSchema

MappingsSchema

Mappings

Sources can be:• Relational DBs• Excel Files• Web Sites• Web Services

9

• Differences in:– Names in schema– Attribute grouping

– Coverage of databases– Granularity and format of attributes

Inventory Database B

AuthorsISBNFirstNameLastName

BooksTitleISBNPriceDiscountPriceEdition

Inventory Database A

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Schema Mappings

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

CDsAlbumASINPriceDiscountPriceStudio

10

Issues for Schema Mappings

Design-Time

• What formalisms to express them?

• How to create them?• Can we discover them

somehow?• How do we use them?

End Users

Applications

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

SchemaMappingsSchema

MappingsSchema

Mappings

11

Mediator

Virtual Integration Architecture

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Run-Time

Reformulation

Optimization

Execution

Query Result

Wrapper Wrapper

12

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Reformulation

Reformulation

Query

• User queries refer to the global schema

• Data is stored in the sources in a local schema

• Rewriting algorithms

13

Issues for Query Processing

Reformulation

Global Schema

BooksTitleISBNPriceDiscountPriceEdition

Local Schema A

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

SELECT ISBN, PriceFROM BooksWHERE Title = ‘on the road’

SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ‘on the road’AND ItemType = ‘Books’

14

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Query Translation

Reformulation

Optimization

Execution

Query

Wrapper

• Different query languages

15

Local Source A

Issues for Query Processing

Query Translation

Global Schema

BooksTitleISBNPriceDiscountPriceEdition

SELECT ISBN, PriceFROM BooksWHERE Title = ‘on the road’

http://www.amazon.com/homepage.html?ItemType=Books&Title=on+the+road

16

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Data Translation

Reformulation

Optimization

Execution

Query

Wrapper

• Different data models

17

Issues for Query Processing

Data Translation

<table> <tr> <td> <a href=/details?isbn=123> <b>On the Road</b> </a> -- by Jack Kerouac; Paperback <br> <a href=/details?isbn=123> Buy new </a> :<b class=price>$10.86</b> </td> </tr></table>

Local Result A

Global Schema

BooksTitleISBNPriceDiscountPriceEdition

Title ISBN Price … …

On the Road 123 10.86 … …

18

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Query Execution

Reformulation

Optimization

Execution

Query

Wrapper Wrapper

• Access as many data sources as needed

• Duplicate/redundant and irrelevant data

• Limited query capabilities

19

Issues for Query Processing

Limited Query Capabilities

Global Schema

BooksTitleISBNPriceDiscountPriceEdition

Local Schema A

BooksAndMusicTitleAuthorItemIDItemTypeSuggestedPrice

SELECT ISBN, Price, DiscountPriceFROM BooksWHERE Title = ‘on the road’

SELECT GreatPriceFROM DiscountBooksWHERE ISBN = ?

Local Schema B

DiscountBooksTitleEditionISBNGreatPrice

SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ?

SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ‘on the road’

A

B

SELECT GreatPriceFROM DiscountBooksWHERE ISBN = 123

C

ItemID SuggestedPrice

123 10.86

ItemID SuggestedPrice

123 10.86D

E

GreatPrice

8.86

ISBN Price DiscountPrice

123 10.86 8.86

20

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Query Answering

Reformulation

Optimization

Execution

Query Result

Wrapper Wrapper

• Combine the results and further process them if needed

• Mainly union and merge• Inconsistencies

21

Issues for Query Processing

Query Answering (Union)

ItemID SuggestedPrice

123 10.86

ISBN GreatPrice

456 8.86

ISBN Price

123 10.86

456 8.86

22

Issues for Query Processing

Query Answering (Merge)

ItemID Title

123 On the Road

ISBN Edition Price

123 2nd 8.86

ISBN Title Edition Price

123 On the Road 2nd 8.86

PrimaryKey

ISBN Title Edition Price

123 On the Road 2nd 8.86

PrimaryKey

PrimaryKey

23

Issues for Query Processing

Query Answering (Inconsistencies)

ItemID Title Edition

123 On the Road 1st

ISBN Edition Price

123 2nd 8.86

ISBN Title Edition Price

123 On the Road 8.86

PrimaryKey

ISBN Title Edition Price

123 On the Road ??? 8.86

PrimaryKey

PrimaryKey

24

Peer-Based Integration

Peer 2

Peer 1

Peer 5

Peer 3

Peer 4Query

Query

25

Peer-Based Integration

• No need for a central mediated schema• Peers serve as mediators for other peers• A peer can be both a server and a client• Semantic relationships are specified locally

(between small sets of peers)• Queries are posed using the peer’s schema• Answers come from anywhere in the system• This is not P2P file sharing.

– Data has rich semantics

26

References

• Information integration– Maurizio Lenzerini

– Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003

– Invited Tutorial

• Data Integration: a Status Report– Alon Halevy

– German Database Conference (BTW), 2003– Invited Talk