CSE 636 Data Integration
description
Transcript of CSE 636 Data Integration
CSE 636Data Integration
Overview
Fall 2006
2
What is Data Integration?
The problem of providing• uniform (sources transparent to user)• access to (query, and eventually updates too)• multiple (even 2 is a problem!)• autonomous (not affect the behavior of sources)• heterogeneous (different data models, schemas)• structured (at least semistructured)• data sources (not only databases)
3
Motivation
• Enterprise data integration; web-site construction.
• World-wide web:– comparison shopping (Netbot, Junglee)– portals integrating data from multiple sources– XML integration
• Science & culture– Medical genetics: integrating genomic data– Astrophysics: monitoring events in the sky– Environment: Puget Sound Regional Synthesis Model– Culture: uniform access to all the cultural databases
produced by different countries.
4
Principle Dimensions of Data Integration
• Virtual vs. materialized architecture• Access: query only or query&update?
– problem similar to updating through views– need distributed transactional services.
• Mediated schema: yes or no?– Mediated schema requires schema integration and
then query reformulation.– Without mediated schema, we lose some of the
advantages of data integration.
5
Data Warehouse Architecture
DataSource
DataSource
Relational Database(Warehouse)
DataSource
Users
Applications
OLAP / Decision SupportData Cubes / Data Mining
ETL Tools(Extract-Transform-Load)
Data Cleaning
6
7
Virtual Integration Architecture
• Leave the data in the sources• When a query comes in:
– Determine the relevant sources to the query– Break down the query into sub-queries for the sources– Get the answers from the sources, filter them if needed
and combine them appropriately
• Data is fresh• Otherwise known as
On Demand Integration
8
Virtual Integration Architecture
End Users
Applications
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Design-Time
SchemaMappingsSchema
MappingsSchema
Mappings
Sources can be:• Relational DBs• Excel Files• Web Sites• Web Services
9
• Differences in:– Names in schema– Attribute grouping
– Coverage of databases– Granularity and format of attributes
Inventory Database B
AuthorsISBNFirstNameLastName
BooksTitleISBNPriceDiscountPriceEdition
Inventory Database A
BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords
Schema Mappings
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
CDsAlbumASINPriceDiscountPriceStudio
10
Issues for Schema Mappings
Design-Time
• What formalisms to express them?
• How to create them?• Can we discover them
somehow?• How do we use them?
End Users
Applications
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
SchemaMappingsSchema
MappingsSchema
Mappings
11
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Run-Time
Reformulation
Optimization
Execution
Query Result
Wrapper Wrapper
12
Mediator
Issues for Query Processing
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Reformulation
Reformulation
Query
• User queries refer to the global schema
• Data is stored in the sources in a local schema
• Rewriting algorithms
13
Issues for Query Processing
Reformulation
Global Schema
BooksTitleISBNPriceDiscountPriceEdition
Local Schema A
BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords
SELECT ISBN, PriceFROM BooksWHERE Title = ‘on the road’
SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ‘on the road’AND ItemType = ‘Books’
14
Mediator
Issues for Query Processing
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Query Translation
Reformulation
Optimization
Execution
Query
Wrapper
• Different query languages
15
Local Source A
Issues for Query Processing
Query Translation
Global Schema
BooksTitleISBNPriceDiscountPriceEdition
SELECT ISBN, PriceFROM BooksWHERE Title = ‘on the road’
http://www.amazon.com/homepage.html?ItemType=Books&Title=on+the+road
16
Mediator
Issues for Query Processing
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Data Translation
Reformulation
Optimization
Execution
Query
Wrapper
• Different data models
17
Issues for Query Processing
Data Translation
<table> <tr> <td> <a href=/details?isbn=123> <b>On the Road</b> </a> -- by Jack Kerouac; Paperback <br> <a href=/details?isbn=123> Buy new </a> :<b class=price>$10.86</b> </td> </tr></table>
Local Result A
Global Schema
BooksTitleISBNPriceDiscountPriceEdition
Title ISBN Price … …
On the Road 123 10.86 … …
18
Mediator
Issues for Query Processing
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Query Execution
Reformulation
Optimization
Execution
Query
Wrapper Wrapper
• Access as many data sources as needed
• Duplicate/redundant and irrelevant data
• Limited query capabilities
19
Issues for Query Processing
Limited Query Capabilities
Global Schema
BooksTitleISBNPriceDiscountPriceEdition
Local Schema A
BooksAndMusicTitleAuthorItemIDItemTypeSuggestedPrice
SELECT ISBN, Price, DiscountPriceFROM BooksWHERE Title = ‘on the road’
SELECT GreatPriceFROM DiscountBooksWHERE ISBN = ?
Local Schema B
DiscountBooksTitleEditionISBNGreatPrice
SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ?
SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ‘on the road’
A
B
SELECT GreatPriceFROM DiscountBooksWHERE ISBN = 123
C
ItemID SuggestedPrice
123 10.86
ItemID SuggestedPrice
123 10.86D
E
GreatPrice
8.86
ISBN Price DiscountPrice
123 10.86 8.86
20
Mediator
Issues for Query Processing
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Query Answering
Reformulation
Optimization
Execution
Query Result
Wrapper Wrapper
• Combine the results and further process them if needed
• Mainly union and merge• Inconsistencies
21
Issues for Query Processing
Query Answering (Union)
ItemID SuggestedPrice
123 10.86
ISBN GreatPrice
456 8.86
ISBN Price
123 10.86
456 8.86
22
Issues for Query Processing
Query Answering (Merge)
ItemID Title
123 On the Road
ISBN Edition Price
123 2nd 8.86
ISBN Title Edition Price
123 On the Road 2nd 8.86
PrimaryKey
ISBN Title Edition Price
123 On the Road 2nd 8.86
PrimaryKey
PrimaryKey
23
Issues for Query Processing
Query Answering (Inconsistencies)
ItemID Title Edition
123 On the Road 1st
ISBN Edition Price
123 2nd 8.86
ISBN Title Edition Price
123 On the Road 8.86
PrimaryKey
ISBN Title Edition Price
123 On the Road ??? 8.86
PrimaryKey
PrimaryKey
24
Peer-Based Integration
Peer 2
Peer 1
Peer 5
Peer 3
Peer 4Query
Query
25
Peer-Based Integration
• No need for a central mediated schema• Peers serve as mediators for other peers• A peer can be both a server and a client• Semantic relationships are specified locally
(between small sets of peers)• Queries are posed using the peer’s schema• Answers come from anywhere in the system• This is not P2P file sharing.
– Data has rich semantics
26
References
• Information integration– Maurizio Lenzerini
– Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003
– Invited Tutorial
• Data Integration: a Status Report– Alon Halevy
– German Database Conference (BTW), 2003– Invited Talk