CSE 636 Data Integration

26
CSE 636 Data Integration Overview Fall 2006

description

CSE 636 Data Integration. Overview. Fall 2006. What is Data Integration?. The problem of providing uniform (sources transparent to user) access to (query, and eventually updates too) multiple (even 2 is a problem!) autonomous (not affect the behavior of sources) - PowerPoint PPT Presentation

Transcript of CSE 636 Data Integration

Page 1: CSE 636 Data Integration

CSE 636Data Integration

Overview

Fall 2006

Page 2: CSE 636 Data Integration

2

What is Data Integration?

The problem of providing• uniform (sources transparent to user)• access to (query, and eventually updates too)• multiple (even 2 is a problem!)• autonomous (not affect the behavior of sources)• heterogeneous (different data models, schemas)• structured (at least semistructured)• data sources (not only databases)

Page 3: CSE 636 Data Integration

3

Motivation

• Enterprise data integration; web-site construction.

• World-wide web:– comparison shopping (Netbot, Junglee)– portals integrating data from multiple sources– XML integration

• Science & culture– Medical genetics: integrating genomic data– Astrophysics: monitoring events in the sky– Environment: Puget Sound Regional Synthesis Model– Culture: uniform access to all the cultural databases

produced by different countries.

Page 4: CSE 636 Data Integration

4

Principle Dimensions of Data Integration

• Virtual vs. materialized architecture• Access: query only or query&update?

– problem similar to updating through views– need distributed transactional services.

• Mediated schema: yes or no?– Mediated schema requires schema integration and

then query reformulation.– Without mediated schema, we lose some of the

advantages of data integration.

Page 5: CSE 636 Data Integration

5

Data Warehouse Architecture

DataSource

DataSource

Relational Database(Warehouse)

DataSource

Users

Applications

OLAP / Decision SupportData Cubes / Data Mining

ETL Tools(Extract-Transform-Load)

Data Cleaning

Page 6: CSE 636 Data Integration

6

Page 7: CSE 636 Data Integration

7

Virtual Integration Architecture

• Leave the data in the sources• When a query comes in:

– Determine the relevant sources to the query– Break down the query into sub-queries for the sources– Get the answers from the sources, filter them if needed

and combine them appropriately

• Data is fresh• Otherwise known as

On Demand Integration

Page 8: CSE 636 Data Integration

8

Virtual Integration Architecture

End Users

Applications

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Design-Time

SchemaMappingsSchema

MappingsSchema

Mappings

Sources can be:• Relational DBs• Excel Files• Web Sites• Web Services

Page 9: CSE 636 Data Integration

9

• Differences in:– Names in schema– Attribute grouping

– Coverage of databases– Granularity and format of attributes

Inventory Database B

AuthorsISBNFirstNameLastName

BooksTitleISBNPriceDiscountPriceEdition

Inventory Database A

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Schema Mappings

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

CDsAlbumASINPriceDiscountPriceStudio

Page 10: CSE 636 Data Integration

10

Issues for Schema Mappings

Design-Time

• What formalisms to express them?

• How to create them?• Can we discover them

somehow?• How do we use them?

End Users

Applications

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

SchemaMappingsSchema

MappingsSchema

Mappings

Page 11: CSE 636 Data Integration

11

Mediator

Virtual Integration Architecture

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Run-Time

Reformulation

Optimization

Execution

Query Result

Wrapper Wrapper

Page 12: CSE 636 Data Integration

12

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Reformulation

Reformulation

Query

• User queries refer to the global schema

• Data is stored in the sources in a local schema

• Rewriting algorithms

Page 13: CSE 636 Data Integration

13

Issues for Query Processing

Reformulation

Global Schema

BooksTitleISBNPriceDiscountPriceEdition

Local Schema A

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

SELECT ISBN, PriceFROM BooksWHERE Title = ‘on the road’

SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ‘on the road’AND ItemType = ‘Books’

Page 14: CSE 636 Data Integration

14

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Query Translation

Reformulation

Optimization

Execution

Query

Wrapper

• Different query languages

Page 15: CSE 636 Data Integration

15

Local Source A

Issues for Query Processing

Query Translation

Global Schema

BooksTitleISBNPriceDiscountPriceEdition

SELECT ISBN, PriceFROM BooksWHERE Title = ‘on the road’

http://www.amazon.com/homepage.html?ItemType=Books&Title=on+the+road

Page 16: CSE 636 Data Integration

16

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Data Translation

Reformulation

Optimization

Execution

Query

Wrapper

• Different data models

Page 17: CSE 636 Data Integration

17

Issues for Query Processing

Data Translation

<table> <tr> <td> <a href=/details?isbn=123> <b>On the Road</b> </a> -- by Jack Kerouac; Paperback <br> <a href=/details?isbn=123> Buy new </a> :<b class=price>$10.86</b> </td> </tr></table>

Local Result A

Global Schema

BooksTitleISBNPriceDiscountPriceEdition

Title ISBN Price … …

On the Road 123 10.86 … …

Page 18: CSE 636 Data Integration

18

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Query Execution

Reformulation

Optimization

Execution

Query

Wrapper Wrapper

• Access as many data sources as needed

• Duplicate/redundant and irrelevant data

• Limited query capabilities

Page 19: CSE 636 Data Integration

19

Issues for Query Processing

Limited Query Capabilities

Global Schema

BooksTitleISBNPriceDiscountPriceEdition

Local Schema A

BooksAndMusicTitleAuthorItemIDItemTypeSuggestedPrice

SELECT ISBN, Price, DiscountPriceFROM BooksWHERE Title = ‘on the road’

SELECT GreatPriceFROM DiscountBooksWHERE ISBN = ?

Local Schema B

DiscountBooksTitleEditionISBNGreatPrice

SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ?

SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ‘on the road’

A

B

SELECT GreatPriceFROM DiscountBooksWHERE ISBN = 123

C

ItemID SuggestedPrice

123 10.86

ItemID SuggestedPrice

123 10.86D

E

GreatPrice

8.86

ISBN Price DiscountPrice

123 10.86 8.86

Page 20: CSE 636 Data Integration

20

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Query Answering

Reformulation

Optimization

Execution

Query Result

Wrapper Wrapper

• Combine the results and further process them if needed

• Mainly union and merge• Inconsistencies

Page 21: CSE 636 Data Integration

21

Issues for Query Processing

Query Answering (Union)

ItemID SuggestedPrice

123 10.86

ISBN GreatPrice

456 8.86

ISBN Price

123 10.86

456 8.86

Page 22: CSE 636 Data Integration

22

Issues for Query Processing

Query Answering (Merge)

ItemID Title

123 On the Road

ISBN Edition Price

123 2nd 8.86

ISBN Title Edition Price

123 On the Road 2nd 8.86

PrimaryKey

ISBN Title Edition Price

123 On the Road 2nd 8.86

PrimaryKey

PrimaryKey

Page 23: CSE 636 Data Integration

23

Issues for Query Processing

Query Answering (Inconsistencies)

ItemID Title Edition

123 On the Road 1st

ISBN Edition Price

123 2nd 8.86

ISBN Title Edition Price

123 On the Road 8.86

PrimaryKey

ISBN Title Edition Price

123 On the Road ??? 8.86

PrimaryKey

PrimaryKey

Page 24: CSE 636 Data Integration

24

Peer-Based Integration

Peer 2

Peer 1

Peer 5

Peer 3

Peer 4Query

Query

Page 25: CSE 636 Data Integration

25

Peer-Based Integration

• No need for a central mediated schema• Peers serve as mediators for other peers• A peer can be both a server and a client• Semantic relationships are specified locally

(between small sets of peers)• Queries are posed using the peer’s schema• Answers come from anywhere in the system• This is not P2P file sharing.

– Data has rich semantics

Page 26: CSE 636 Data Integration

26

References

• Information integration– Maurizio Lenzerini

– Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003

– Invited Tutorial

• Data Integration: a Status Report– Alon Halevy

– German Database Conference (BTW), 2003– Invited Talk