CSE 636 Data Integration Data Integration Approaches.
-
Upload
marianna-patchin -
Category
Documents
-
view
235 -
download
1
Transcript of CSE 636 Data Integration Data Integration Approaches.
CSE 636Data Integration
Data Integration Approaches
2
Virtual Integration Architecture
• Leave the data in the sources• When a query comes in:
– Determine the relevant sources to the query– Break down the query into sub-queries for the sources– Get the answers from the sources, filter them if needed
and combine them appropriately
• Data is fresh• Otherwise known as
On Demand Integration
3
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query Result
Wrapper Wrapper
End User
Design-Time
MediationLanguage
Mapping Tool
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
1
4
Design-Time
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query Result
Wrapper Wrapper
End User
MediationLanguage
Mapping Tool
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
1
2
5
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query Result
Wrapper Wrapper
End User
Design-Time
MediationLanguage
Mapping Tool
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
1
2
3
6
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query Result
Wrapper Wrapper
End User
Design-Time
MediationLanguage
Mapping Tool
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
1
2
3
4
7
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query Result
Wrapper Wrapper
End User
Design-Time
MediationLanguage
Mapping Tool
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
1
2
5
3
4
8
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query ResultEnd User
Wrapper Wrapper
Design-Time
MediationLanguage
Mapping Tool
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
1
2
5
63
4
9
Dimensions to Consider:• How many sources are we accessing?• How autonomous are they?• Meta-data about sources?• Is the data structured?• Queries or also updates?• Requirements: accuracy, completeness,
performance, handling inconsistencies.• Closed world assumption vs. open world?
Virtual Integration Approaches
10
Logic
Mediation Languages
AuthorsISBNFirstNameLastName
BooksTitleISBNPriceDiscountPriceEdition
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
CDsAlbumASINPriceDiscountPriceStudio
Global Schema
CDASINTitleGenre…
ArtistASINName…
11
• Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources.
• Easy addition: make it easy to add new data sources.
• Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.
Desiderata from Source Descriptions
12
Given:• A query Q posed over the global schema• Descriptions of the data sourcesFind:• A query Q’ over the data source relations, such
that:– Q’ provides only correct answers to Q, and– Q’ provides all possible answers from to Q given the
sources.
Reformulation Problem
13
Languages for Schema Mapping
Mediated Schema
Q
Q’ Q’ Q’ Q’ Q’
GAVLAV GLAV
Source Source Source Source Source
LocalSchema
LocalSchema
LocalSchema
LocalSchema
LocalSchema
MediatorGlobal
Schema
14
Global-as-View (GAV)
Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time)
Integrating View:Create View Movie AS
SELECT * FROM S1 [S1(title,dir,year,genre)]
union
SELECT * FROM S2 [S2(title,dir,year,genre)]
union
SELECT S3.title, S3.dir, S4.year, S4.genre
FROM S3, S4 [S3(title,dir),
WHERE S3.title = S4.title S4(title,year,genre)]
15
Global-as-View: Example 2
Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time)
Integrating View:Create View Movie AS
SELECT title, dir, year, NULL
FROM S1 [S1(title,dir,year)]
union
SELECT title, dir, NULL, genre
FROM S2 [S2(title,dir,genre)]
16
Global-as-View: Example 3
Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time)
Integrating Views:Create View Movie AS
SELECT NULL, NULL, NULL, genre
FROM S4 [S4(cinema, genre)]
Create View Schedule AS
SELECT cinema, NULL, NULL
FROM S4 [S4(cinema, genre)]
But what if we want to find which cinemas are playing comedies?
17
Global-as-View Summary
• Query reformulation boils down to view unfolding.
• Very easy conceptually.• Can build hierarchies of global schemas.• You sometimes loose information. Not always
natural.• Adding sources is hard. Need to consider all
other sources that are available.
18
Local-as-View (LAV)
Mediated Schema
Source1
Source2
Source3
Source4
Source5
Local Schema LocalSchema
LocalSchema
LocalSchema
MediatorGlobal Schema
BookISBNTitleGenreYear
AuthorISBNName
R1ISBNTitleName
Local Schema
R5ISBNTitle
Books before 1970 Humor Books
Create View R1 ASSELECT B.ISBN, B.Title, A.NameFROM Book B, Author AWHERE A.ISBN = B.ISBN AND B.Year < 1970
Create View R5 ASSELECT B.ISBN, B.TitleFROM Book BWHERE B.Genre =
‘Humor’
19
Query Reformulation
Mediated Schema
Source1
Source2
Source3
Source4
Source5
Local Schema LocalSchema
LocalSchema
LocalSchema
MediatorGlobal Schema
BookISBNTitleGenreYear
AuthorISBNName
R1ISBNTitleName
Local Schema
R5ISBNTitle
Books before 1970 Humor Books
Query: Find authors of humor books
Plan: R1 Join R5
20
Query Reformulation
Mediated Schema
Source1
Source2
Source3
Source4
Source5
Local Schema LocalSchema
LocalSchema
LocalSchema
MediatorGlobal Schema
BookISBNTitleGenreYear
AuthorISBNName
R1ISBNTitleName
Local Schema
R5ISBNTitle
Books before 1970 Humor Books
Query: Find authors of humor books before 1960
Plan: Can’t do it!
21
Local-as-View: Example 1
Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time)
Source Views:Create Source S1 AS [S1(title, dir, year, genre)]
SELECT * FROM Movie
Create Source S3 AS [S3(title, dir)]
SELECT title, dir FROM Movie
Create Source S5 AS [S5(title, dir, year)]
SELECT title, dir, year
FROM Movie
WHERE year > 1960 AND genre=‘Comedy’
22
Local-as-View: Example 2
Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time)
Source Views:Create Source S4 [S4(cinema, genre)]
SELECT cinema, genre
FROM Movie M, Schedule S
WHERE M.title=S.title
Now if we want to find which cinemas are playing comedies, there is hope!
23
• Very flexible. You have the power of the entire query language to define the contents of the source.
• Hence, can easily distinguish between contents of closely related sources.
• Adding sources is easy: they’re independent of each other.
• Query reformulation: answering queries using views!
Local-as-View Summary
24
The General Problem
• Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn?
• Many, many papers on this problem• The best performing algorithm:
The MiniCon Algorithm (Pottinger & Halevy, VLDB 2000)
25
Local Completeness Information
• If sources are incomplete, we need to look at each one of them.
• Often, sources are locally complete.• Movie(title, director, year) complete for years
after 1960, or for American directors.• Question: given a set of local completeness
statements, is a query Q’ a complete answer to Q?
26
• Movie(title, director, year)– complete after 1960
• Show(title, theater, city, hour)• Query: find movies (and directors) playing in
Seattle: SELECT M.title, M.director
FROM Movie M, Show S
WHERE M.title=S.title
AND city=‘Seattle’• Complete or not?
Example
27
• Movie(title, director, year), Oscar(title, year)• Query: find directors whose movies won Oscars
after 1965: SELECT M.director
FROM Movie M, Oscar O
WHERE M.title=O.title
AND M.year=O.year
AND O.year > 1965• Complete or not?
Example #2
28
References
• Information integration– Maurizio Lenzerini
– Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003
– Invited Tutorial
• Data Integration: a Status Report– Alon Halevy
– German Database Conference (BTW), 2003– Invited Talk