1
Corso di Rappresentazione della Informazione e della Conoscenza
Anno Accademico 2007-2008Matteo Palmonari
Query Processing in Data Integration
Materiale organizzato da presentazioni di: L. Tanca, S. Costantini, J. Sun & L. Zhao, M. Van
Der Wielen, Ir. R. Vdovjak, Kambhampati & Knoblock, S. Bergamaschi
2
An introduction to data integration
Prof. Letizia TancaPolitecnico di Milano
Technologies for Information Systems
3
An Online Shopper’s Information Integration Problem
El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”
?Information Integration
addall.com
““One-World”One-World”MediationMediation
amazon.comamazon.com A1books.comA1books.comhalf.comhalf.combarnes&noble.combarnes&noble.com
4
What is a Data Integration System?
– Uniform (same query interface to all sources)– Access to (queries; eventually updates too)– Multiple (we want many, but 2 is hard too)– Autonomous (DBA doesn’t report to you)– Heterogeneous (data models are different)– Structured (or at least semi-structured)– Data Sources (not only databases).
A system providing:
5
Virtual Integration Architecture
Datasource
wrapper
Datasource
wrapper
Datasource
wrapper
Sources can be: relational, hierarchical (IMS), structured files, web sites.
Mediator:
User queriesMediated schema
Data sourcecatalog
Reformulator
Optimizer
Execution engine
6
Query Model in Virtual Integration
• User formulates query in terms of his/her ontology on the mediated (or “global”) schema
• System reformulates queries in terms of sub-queries for each source (“local” schema)
• Structure of the query model should be more intuitive for the user
7
Reformulation Problem
• Given:– A query Q posed over the mediated schema– Descriptions of the data sources
• Find:– A query Q’ over the data source relations,
such that:• Q’ provides only correct answers to Q, and• Q’ provides all possible answers from to Q
given the sources.
8
Mapping between the global logical schema and the single source schemata
(logical view definition)
• Two basic approaches– GAV (Global As View)– LAV (Local As View)
• Can be used also in case of different data models
• In that case a model transformation is required
9
GAV (Global As View)GAV (Global As View)
• Up to now we supposed that the global schema be derived from the integration process of the data source schemata
• Thus the global schema is expressed in terms of the data source schemata
• Such approach is called the Global As View approach
10
The other possible ways…The other possible ways…
LAV (Local As View)LAV (Local As View)• The global schema has been designed
independently of the data source schemata• The relationship (mapping) between sources
and global schema is obtained by defining each data source as a view over the global schema
GLAV (Global and Local As View)GLAV (Global and Local As View)• The relationship (mapping) between sources
and global schema is obtained by defining a set of views, some over the global schema and some over the data sources
11
• Global schema GG• Source schemata schemata SS • Mapping Mapping MM between sources and global between sources and global
schema: a set of assertions schema: a set of assertions
qqS S q qGG
qqG G q qSS
Intuitively, the first assertion specifies that the Intuitively, the first assertion specifies that the concept represented by a view (query) qconcept represented by a view (query) qSS over a over a
source schema S corresponds to the concept source schema S corresponds to the concept specified by qspecified by qG G over the global schema. over the global schema.
Viceversa for the second assertion.Viceversa for the second assertion.
Mapping between data sources and global schema
12
• A data integration system is a triple A data integration system is a triple (G, S, M)(G, S, M)
• The query to the integrated system are posed The query to the integrated system are posed in terms of G and specify which data of the in terms of G and specify which data of the virtual database we are interested invirtual database we are interested in
• The problem is understanding which real data The problem is understanding which real data (in the data sources) correspond to those (in the data sources) correspond to those virtual datavirtual data
Mapping between data sources and mediated schema
13
GAVGAV• A GAV mapping is a set of assertions, one
for each element g of G
g g q qSS
That is, the mapping specifies g as a query qS over the data sources. This means that the mapping tells us exactly how the element g is computed.
OK for stable data sourcesOK for stable data sources Difficult to extend with a new data sourceDifficult to extend with a new data source
14
GAV example
SOURCE 1
Product(Code, Name, Description, Warnings, Notes, CatID)Category(ID, Name, Description)Version(ProductCode, VersionCode, Size, Color, Name,
Description, Stock, Price)
SOURCE 2
Product(Code, Name, Size, Color, Description, Type, Price, Q.ty)Tipe(TypeCode, Name, Description)
n.b.: WE DO NOT CARE ABOUT DATA TYPES…
15
SOURCE 1
Product(Code, Name, Description, Warnings, Notes, CatID)Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock,
Price)
SOURCE 2
Product(Code, Name, Size, Color, Description, Type, Price, Q.ty)
GLOBAL SCHEMA
CREATE VIEW GLOB-PROD ASSELECT Code AS PCode, VersionCode as VCode, Version.NAme AS Name,
Size, Color, Version.Description as Description, CatID, Version.Price, Stock
FROM SOURCE1.Product, SOURCE1.VersionWHERE Code = ProductCodeUNIONSELECT Code AS PCode, null as VCode, Name, Size, Color, Description,Type
as CatID, Price, Q.ty AS StockFROM SOURCE2.Product
16
GAVGAV
• Suppose now we introduce a new sourceSuppose now we introduce a new source• The simple view we have just created is The simple view we have just created is
to be modified to be modified • In the simplest case we only need to add In the simplest case we only need to add
a union with a new SELECT-FROM-a union with a new SELECT-FROM-WHERE clauseWHERE clause
• This is not true in general, view This is not true in general, view definitions may be much more complexdefinitions may be much more complex
17
GAV
• Quality depends on how well we have compiled the sources into the global schema through the mapping
• Whenever a source changes or a new one is added, the global schema needs to be reconsidered
• Query processing can be based on some sort of unfolding
• Example: one already seen
18
LAV
The Local-as-View approach, see Figure, describes local sources as a view defined in terms of the global schema. The global schema is predefined and for each local source is described how it delivers information to the global schema.Each mapping associates the entities in the source schemas by way of a query over the global schema. Thereby, each source schema is defined as a view over the global schema, hence the name “Local-as-View”.
19
LAVLAVA mapping LAV is a set of assertions, one for
each element s of each source S
s s q qGG
Thus the content of each source is characterized in terms of a view qG over the global schema
• OK if the global schema is stable, e.g. based on a OK if the global schema is stable, e.g. based on a domain ontology or an enterprise model domain ontology or an enterprise model
• It favours extensibility It favours extensibility • Query processing much more complexQuery processing much more complex
20
LAV
• Quality depends on how well we have characterized the sources
• High modularity and extensibility (if the global schema is well designed, when a source changes or is added, only its definition is to be updated)
• Query processing needs reasoning
21
SOURCE 1
Product(Code, Name, Description, Warnings, Notes, CatID)Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock,
Price)
SOURCE 2
Product(Code, Name, Size, Color, Description, Type, Price, Q.ty)
GLOBAL SCHEMA
GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock)
In this case we have to express the sources as views over the global schema
22
GLOBAL SCHEMA
GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock)
SOURCE 2
Product (Code, Name, Size, Color, Description, Type, Price, Q.ty)
CREATE VIEW SOURCE2.Product ASSELECT PCode AS Code, Name, Size, Color,
Description, CatID as Type, Price, Stock AS Q.ty
FROM GLOB-PROD
23
GLOBAL SCHEMA
GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock)
SOURCE 1
Product(Code, Name, Description, Warnings, Notes, CatID)Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock,
Price)
CREATE VIEW SOURCE1.Product ASSELECT Pcode AS Code, ?Name?, ?Description?, ? Warnings ?, ?Notes?,
CatIDFROM GLOB-PROD
CREATE VIEW SOURCE1. Version ASSELECT Pcode AS ProductCode, VCode as VersionCode, Size, Color, Name, ?
Description ? , Stock, Price)FROM GLOB-PROD
N.B.: Some information is lacking: either we don’t know where description has to be taken from, or there is no correspondent value (e.g.Warnings, Notes, etc..). A difficult job
24
The GLAV approach
• Is a combination of LAV and GAV. Part of the mappings are of LAV type, part of GAV type.
• It combines the advantages of the two approaches.
Kambhampati & Knoblock Information Integration on the Web (MA-1) 25
Approaches for relating source & Mediator Schemas
• Global-as-view (GAV): express the mediated schema relations as a set of views over the data source relations
• Local-as-view (LAV): express the source relations as views over the mediated schema.
• Can be combined…?
CREATE VIEW Seattle-view AS
SELECT buyer, seller, product, store FROM Person, Purchase WHERE Person.city = “Seattle” AND Person.name = Purchase.buyer
We can later use the views: SELECT name, store FROM Seattle-view, Product WHERE Seattle-view.product = Product.name AND Product.category = “shoes”
Virtual vsMaterialized
“View” Refresher
Let’s compare them in a movie
Database integration scenario..
Differences minor for data aggregation…
Kambhampati & Knoblock Information Integration on the Web (MA-1) 26
Global-as-ViewMediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title
Express mediator schemarelations as views oversource relations
Kambhampati & Knoblock Information Integration on the Web (MA-1) 27
Global-as-ViewMediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title
Express mediator schemarelations as views oversource relations
Mediator schema relations are Virtual views on source relations
Kambhampati & Knoblock Information Integration on the Web (MA-1) 28
Global-as-View: Example 2Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).
Create View Movie AS [S1(title,dir,year)] select title, dir, year, NULL from S1 union [S2(title, dir,genre)] select title, dir, NULL, genre from S2
Null values
Express mediator schemarelations as views oversource relations
Kambhampati & Knoblock Information Integration on the Web (MA-1) 29
Global-as-View: Example 2Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4: S4(cinema, genre)Create View Movie AS
select NULL, NULL, NULL, genre
from S4
Create View Schedule AS
select cinema, NULL, NULL
from S4.
But what if we want to find which cinemas are playing comedies?
“Lossy Mediation”
Express mediator schemarelations as views oversource relations
Kambhampati & Knoblock Information Integration on the Web (MA-1) 30
Local-as-View: example 1Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
S1(title,dir,year,genre)
S3(title,dir)
S5(title,dir,year), year >1960
Create Source S1 AS
select * from Movie
Create Source S3 AS
select title, dir from Movie
Create Source S5 AS
select title, dir, year
from Movie
where year > 1960 AND genre=“Comedy”Sources are “materialized views” ofmediator schema
Express source schemarelations as views overmediator relations
Kambhampati & Knoblock Information Integration on the Web (MA-1) 31
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
S1(title,dir,year,genre)
S3(title,dir)
S5(title,dir,year), year >1960
Create Source S1 AS
select * from Movie
Create Source S3 AS
select title, dir from Movie
Create Source S5 AS
select title, dir, year
from Movie
where year > 1960 AND genre=“Comedy”
Creat Source S4 AS
select cinema, genre
from Movie m, Schedule s
where m.title=s.title
S4(Cinema,Genre)
Express source schemarelations as views overmediator relations
Kambhampati & Knoblock Information Integration on the Web (MA-1) 32
Local-as-View: Example 2Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4: S4(cinema, genre)Create Source S4
select cinema, genre
from Movie m, Schedule s
where m.title=s.title
Now if we want to find which cinemas are playing comedies, there is hope!
Express source schemarelations as views overmediator relations
Kambhampati & Knoblock Information Integration on the Web (MA-1) 33
GAV vs. LAVMediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS
select NULL, NULL, NULL, genre
from S4
Create View Schedule AS
select cinema, NULL, NULL
from S4.
But what if we want to find which cinemas are playing comedies?
Create Source S4
select cinema, genre
from Movie m, Schedule s
where m.title=s.title
Now if we want to find which cinemas are playing comedies, there is hope!
Source S4: S4(cinema, genre)
Lossy mediation
Kambhampati & Knoblock Information Integration on the Web (MA-1) 34
GAV vs. LAV• Not modular
– Addition of new sources changes the mediated schema
• Can be awkward to write mediated schema without loss of information
• Query reformulation easy– reduces to view unfolding
(polynomial)– Can build hierarchies of
mediated schemas
• Best when– Few, stable, data sources– well-known to the mediator (e.g.
corporate integration)• Garlic, TSIMMIS,
HERMES, MOMIS
• Modular--adding new sources is easy
• Very flexible--power of the entire query language available to describe sources
• Reformulation is hard– Involves answering queries only
using views (can be intractable—see below)
• Best when– Many, relatively unknown data
sources– possibility of addition/deletion of
sources• Information Manifold,
InfoMaster, Emerac, Havasu
35
Views: Sound & Completeness• The terms sound, exact and complete are used to express to what degree the
extent of a view corresponds to its definition. • A view defined over some data source is sound if it provides a subset of the
available data in the data source that corresponds to the definition. – It delivers only, but not necessarily all answers to its definition. The answers it does
deliver might be incomplete, but they are correct (sound).
• If a view is complete, it provides a superset of the available data in the data source that corresponds to the definition. It delivers all answers to its definition, and maybe more.
– Since the set of answers might contain more than the answers corresponding to the views definition, but does contain all answers corresponding to the views definition, the term complete is being used.
• A view is exact if it provides all and only data corresponding to the definition.
36
Query answering in GAV
Query answering in the Global-as-View happens through query unfolding.
Unfolding queries to the local sources is relatively easy compared to the Local as-View approach in which queries posted to the global schema have to be rewritten before being able to answer the query.
37
Query answering in GAV
• Because each global entity is defined as a view over the local schemas, the total number of GAV-rules necessary to define the mappings in a data-integration system corresponds to the number of entities in the global schema.
• the extent of a global schema defined by GAV-rules is assumed to be exact.
38
Query answering in LAV
The Local-as-View approach, see Figure, describes local sources as a view defined in terms of the global schema. The global schema is predefined and for each local source is described how it delivers information to the global schema.Each mapping associates the entities in the source schemas by way of a query over the global schema. Thereby, each source schema is defined as a view over the global schema, hence the name “Local-as-View”.
39
Query answering in LAV
• Each entity in the local schemas is defined as a view over the global schema. Therefore, the total number of LAV-rules necessary to define the mappings in a data-integration system corresponds to the number of entities in the local schemas.
• Processing queries in the Local-as-View approach is difficult because the only knowledge of the global-schema available is through the views representing the sources. Such view only provides partial information about the data. Since the mapping associated to each source as a view over the global schema it is not clear how to use the sources in order to answer queries expressed over the global schema. Therefore extracting information from an integration system using the Local-as-View approach is a complex task because one has to answer queries with incomplete information.
• Views defined by LAV-rules might be sound or complete, in terms of 3.1.
40
IntroductionIntroduction
Traditional Integration Formalisms
Global as View (GAV)
Mediated schema as views over data sources
Local as View (LAV)
Data sources as views of mediated schema
Med. Schema TGAV:
T :- S1, S2, S3
LAV:
S1
S1 S2 S3
41
Some DB Theory Schema Conjunctive queries
For example:Give me students together
with their advisors who took courses given by their advisors after the 1st trimester 2003.
namearea
name
Faculty StudentAdvises
* *
CourseRegisteredTeaches
* *
trimester trimestertitlec-number
* *
42
DB Theory SQLSELECT Advises.prof, Advises.studentFROM Registered, Teaches, AdvisesWHERE Registered.c-number = Teaches.c-number and
Registered.trimester=Teaches.trimester and Advises.prof=Teaches.prof and Advises.student=Registered.student and Registered.trimester > “2003\1”
DATALOGQ(prof, student) :-
Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1”
head
body
43
Problem DefinitionProblem Definition
GAV-like Definition – Definition in Datalog
9DC : SkilledPerson(PID, “Doctor”) : -
H : Doctor(SID, h, l, s, e)
9DC : SkilledPerson(PID, “EMT”) : -
H : EMT(SID, h, vid, s, e)
9DC : SkilledPerson(PID, “EMT”) : -
FS : Schedule(PID, vid),
FS : FirstResponse(vid, s, l, d),
FS : Skills(PID, “medical”)
44
Problem DefinitionProblem Definition LAV-like Definition – Inclusion Definition
LH : CritBed(bed, hosp, room, PID, status)
H : CritBed(bed, hosp, room),
H : Patient(PID, bed, status)
LH : EmergBed(bed, hosp, room, PID, status)
H : EmergBed(bed, hosp, room),
H : Patient(PID, bed, status)
45
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
Deductive Databases» Tables viewed as predicates. » Ops on tables expressed as “datalog” rules
› (Horn clauses, without function symbols)
Enames(Name) :- Employe(Name, SSN) [Projection]
Wealthy-Employee(Name):- Employee(Name,SSN), Salary(SSN,Money),Money> 10 [Selection]Ed(Name, Dname):- Employee(Name, SSN), E_Dependents(SSN, Dname) [Join]
ERelated(Name,Dname) :- Ed(Name,Dname)ERelated(Name,Dname) :- Ed(Name,D1), ERelated(D1,D2) [Recursion]
46
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
More datalog terminology
A datalog program is a set of datalog rules.
A program with one rule is a conjunctive query.
We distinguish EDB predicates and IDB predicates
» EDB’s are stored in the database, › appear only in rule bodies
» IDB’s are intensionally defined, › appear in both bodies and heads.
47
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
Approaches to Specifying Schema Descriptions
» Global-as-View: express the mediated schema relations as a set of views over the data source relations
» Local-as-View: express the source relations as views over the mediated schema.
48
DB Theory, query containment A query Q’ is contained in Q if for all databases D, the
set of tuples computed for Q’ is a subset of those computed for Q, i.e., Q’ Q iff D Q’(D) Q(D)
A query Q’ is equivalent to Q iff
Q’ Q and Q Q’
Example:Q’(prof, student) :-
Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\2”
is contained in Q(prof, student) :-
Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1”
49
Query containment check Create a canonical database D that is the “frozen” body of Q’ Compute Q(D) If Q(D) contains the frozen head of Q’ then Q’ Q otherwise not
Example:Q’: p3(x,y) :- arc(x,z),arc(z,w),arc(w,y) Q: path(x,y) :- arc(x,y) path(x,y) :- path(x,z), path(z,y)
1. Freeze Q’ with say 0,1,2,3 for x,z,w,y respectively:Q’(0,3) :- arc(0,1), arc(1,2), arc(2,3)D={arc(0,1), arc(1,2), arc(2,3)}
2. Compute Q(D): Q(D)={(0,1), (1,2), (2,3), (0,2), (1,3), (0,3)}
3. The frozen head of Q’ = (0,3) is in the Q(D) Q’Q
p3x
zw
y
arc
arcarc
50
Query reformulation using views
Problem: Given a user query Q and view definitions V={V1..Vn} Q’ is an Equivalent Rewriting of Q using V iff
Q’ refers only to views in V, Q’ is equivalent to Q
Q’ is a Maximally-contained Rewriting of Q using V iff
Q’ refers only to views in V, Q’ Q, there is no such rewriting Q1 that Q’ Q1 Q, Q1
Q
Kambhampati & Knoblock Information Integration on the Web (MA-1) 51
Reformulation in LAV
• Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn?– Notice the use of materialized (pre-computed)
views• ONLY materialized views should be used…
• Approaches– Bucket algorithm [Levy; 96]– Inverse rules algorithm [Duschka, 99]– Hybrid versions
• SV-Bucket [2001], MiniCon [2001]
Kambhampati & Knoblock Information Integration on the Web (MA-1) 52
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
S1(title,dir,year,genre)
S3(title,dir)
S5(title,dir,year), year >1960
S1(title,dir,year,genre)
S3(title,dir)
S5(title,dir,year), year >1960
Create Source S1 AS
select * from Movie
Create Source S3 AS
select title, dir from Movie
Create Source S5 AS
select title, dir, year
from Movie
where year > 1960 AND genre=“Comedy”
Sources are “materialized views” ofVirtual schema
Create Source S1 AS
select * from Movie
Create Source S3 AS
select title, dir from Movie
Create Source S5 AS
select title, dir, year
from Movie
where year > 1960 AND genre=“Comedy”
Sources are “materialized views” ofVirtual schema
Reformulation in LAV: The issues
Query: Find all the years in which Zhang Yimou released movies.
Select year from movie M where M.dir=yimou
Q(y) :- movie(T,D,Y,G),D=yimou
Not executable
Q(y) :- S1(T,D,Y,G) , D=yimou (1)
Q(y) :- S1(T,D,Y,G) , D=yimou Q(y) :- S5(T,D,Y) , D=yimou (2)
Which is the better plan? What are we looking for? --equivalence? --containment? --Maximal Containment --Smallest plan?
Kambhampati & Knoblock Information Integration on the Web (MA-1) 53
Maximal Containment
• Query plan should be sound and complete– Sound implies that Query should be contained in the
Plan (I.e., tuples satisfying the query are subset of the tuples satisfying the plan
– Completeness?– Traditional DBs aim for equivalence
• Query contains the plan; Plan contains the query– Impossible
• We want all query tuples that can be extracted given the sources we have
– Maximal containment (no other query plan, which “contains” this plan is available)
P contains Q if P |= Q
(exponential even for
“conjunctive”
(SPJ) query plans)
Kambhampati & Knoblock Information Integration on the Web (MA-1) 54
The world of Containment
• Consider Q1(.) :- B1(.) Q2(.) :- B2(.)
• Q1 Q2 (“contained in”) if the answer to Q1 is a subset of Q2
– Basically holds if B1(x) |= B2(x)
• Given a query Q, and a query plan Q1,
– Q1 is a sound query plan if Q1 is contained in Q
– Q1 is a complete query plan if Q is contained in Q1
– Q1 is a maximally contained query plan if there is no Q2 which is a sound query plan for Q1, such that Q1 is contained in Q2
Kambhampati & Knoblock Information Integration on the Web (MA-1) 55
Computing Containment Checks
• Consider Q1(.) :- B1(.) Q2(.) :- B2(.)
• Q1 Q2 (“contained in”) if the answer to Q1 is a subset of Q2
– Basically holds if B1(x) |= B2(x)
– (but entailment is undecidable in general… boo hoo)• Containment can be checked through containment mappings, if the queries are
“conjunctive” (select/project/join queries, without constraints) [ONLY EXPONENTIAL!!!--aren’t you relieved?]
• m is a containment mapping from Vars(Q2) to Vars(Q1) if
m maps every subgoal in the body of Q2 to a subgoal in the body of Q1
m maps the head of Q2 to the head of Q1
Eg: Q1(x,y) :- R(x), S(y), T(x,y) Q2(u,v) :- R(u), S(v)
Containment mapping: [u/x ; v/y]
Kambhampati & Knoblock Information Integration on the Web (MA-1) 56
Reformulation Algorithms
• Bucket algorithm– Cartesian product of buckets – Followed by “containment” check
• Inverse Rules– plan fragments for mediator relations
V1 V2
Q(.) :- V1() & V2()S11() :- V1() S12 :- V1()S21() :- V2() S22 :- V2()S00() :- V1(), V2()
S11S12S00
S21S22S00
V1() :- S11()V1() :- S12()V1() :- S00()V2() :- S21()V2() :- S22()V2() :- S00()
Q(.) :- V1() & V2()
Bucket Algorithm Inverse Rules
[Levy] [Duschka]P1 contains P2 ifP2 |= P1
Kambhampati & Knoblock Information Integration on the Web (MA-1) 57
Movie(T,D,Y,G) :- S1(T,D,Y,G)Movie(T,D,?y,?g ) :- S3(T,D) Movie(T,D,Y,?g) :- S5(T,D,Y), y>1960
Movie(?t,?d,?y,G) :- S4(C,G)Schedule(C,?t,?ti) :- S4(C,G)
Movie(T,D,Y,G)
S1(T,D,Y,G)S3(T,D)S5(T,D,Y)S4(C,G)
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
S1(title,dir,year,genre)
S3(title,dir)
S5(title,dir,year), year >1960
Create Source S1 AS
select * from Movie
Create Source S3 AS
select title, dir from Movie
Create Source S5 AS
select title, dir, year
from Movie
where year > 1960 AND genre=“Comedy”
Creat Source S4 AS
select cinema, genre
from Movie m, Schedule s
where m.title=s.title
S4(Cinema,Genre)
S4(C,G)
SkolemizeSfs5-1(T,D,Y)
Sched(C,T,Ti)
Q(C,G) :- Movie(T,D,Y,G) , Schedule(C,T,Ti)
SkolemizeSfs4-1(G,C)
Kambhampati & Knoblock Information Integration on the Web (MA-1) 62
Complexity of finding maximallycontained plans in LAV
• Complexity does change if the sources are not “conjunctive queries”– Sources as unions of conjunctive queries (NP-hard)
• Disjunctive descriptions
– Sources as recursive queries (Undecidable)• Comparison predicates
• Complexity is less dependent on the query – Recursion okay; but inequality constraints lead to NP-hardness
• Complexity also changes based on Open vs. Closed world assumption
True source contents (of Big Two)
Advertised description “All cars”
[Abiteboul & Duschka, 98]
You can “reduce” the complexity by taking a conjunctive query that is an upperbound. This just pushes the complexity to minimization phase
Advertised description “Toyota” U “Olds”
Big Two
Kambhampati & Knoblock Information Integration on the Web (MA-1) 64
Local Completeness Information• If sources are incomplete, we need to look at each one of them.
• Often, sources are locally complete.
• Movie(title, director, year) complete for years after 1960, or for American directors.
• Question: given a set of local completeness statements, is a query Q’ a complete answer to Q?
True source contentsAdvertised description
Guarantees (LCW; Inter-source comparisons)
Problems: 1. Sources may not be interested in giving these! Need to learn hard to learn!
2. Even if sources are willing to give, there may not be any “big enough” LCWs Saying “I definitely have the car with vehicle ID XXX is useless
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
Data Integration: Global-as-View in Answer Set ProgrammingS. CostantiniUniversità degli Studi di L’[email protected]
66
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
Computing Answers in Data Integration Systems
» Answer set programming (ASP) gives a semantics to datalog (logic) programs with negation (and possibly disjunction in the heads)
» ASP-based specification of a data integration system
» ASP-based specification of extensions (plausible answers, repairs)
67
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
Computing Answers in Data Integration Systems
Methodology works for first-order queries (and Datalog extensions), and many ICs
› LAV: Bravo and Bertossi, IJCAI’03› GAV: Costantini and Formisano,
ASP’03 (experiments and a working
program)
68
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
GaV in ASP
» Instead of unfolding w.r.t. each query, helper model so as to get a general inference engine
» Easily express integrity constraints» Cope with incomplete information (also
resulting form adding new sources)
69
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
GaV: an example
70
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
Front-end: Global Schema through a Helper Modelperson(X):- entity(1,X). % meta-concepts +
namingorganization(X):- entity(2,X).member(X,Y):- relat(1,X,Y).student(X):- entity(3,X).university(X):- entity(4,X).enrolled(X,Y):- relat(2,X,Y).age(X,Y):- attr(1,X,Y).
% Constraints on student and enrolledenrolledE(X):- student(X),university(Y),enrolled(X,Y).:- not student(X),enrolled(X,Y).:- student(X),not enrolledE(X).
71
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
Back-end: Mapping to the Source Schemas
entity(1,X):- s(1,X).entity(2,X):- s(2,X).entity(4,X):- s(5,X).
% student is described implicitly as the domain of an unknown relationship
entity(3,X):- s(3,X,Y). % student is described implicitly as the
domain of one of the sources of enrolledentity(3,X):- s(4,X,Z).
72
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
Back-end: Mapping to the Source Schemas
% Relationships
relat(1,X,Y):- s(7,X,Z),s(8,Z,Y).relat(2,X,Y):- s(4,X,Y).relat(2,X,Y):-s(9,X,Y).attr(1,X,Y):- s(3,X,Y).attr(1,X,Y):- s(6,X,Y,Z).
73
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
Helper Model: Relational Inference Engine
» Meta-rules for is_a chaining:
relat(M,X,Y):- schema(N,E1,E2), entity(E1,X),
entity(E2,Y), is_a_r(N,M), relat(N,X,Y).
entity(M,X):- is_a_e(N,M), entity(N,X).
74
Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/
Dealing with Incomplete Information
» Assume to add a new source for Enrolled, that in the back-end becomes, e.g.,:
relat(2,X,Y):- s(4,X,Y).
relat(2,X,Y):- s(9,X,Y).
» No connection to student: but, one who is enrolled to a University should be a student
MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 7575
DBGroupDATABASE GROUP
MOMIS MOMIS (Mediator envirOnment for Multiple Information (Mediator envirOnment for Multiple Information
Sources)Sources)
The MOMIS Query ManagerThe MOMIS Query ManagerA query execution exampleA query execution example
http://www.dbgroup.unimo.it/Momishttp://www.dbgroup.unimo.it/Momis
MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 7676
DBGroupDATABASE GROUP
The MOMIS Query Manager reformulates the query taking into account the mappings defined among the local classes and the global classes of the GVV (Global Virtual View).The mappings are defined by using a GAV (Global as View) approach: each global class of the GVV is expressed by means of the full-disjunction operator over the local classes.
Query ExecutionQuery Execution
Query rewritingQuery rewriting GAV approachGAV approach: : the query is processed by means of
unfolding
Fusion and ReconciliationFusion and Reconciliation of the local answers into the global answer Object IdentificationObject Identification : Join conditions among local
classes Inconsistencies:Inconsistencies: Resolution functions to deal with
conflits
MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 7777
DBGroupDATABASE GROUP Query rewriting and execution Query rewriting and execution
Global Virtual View
(GVV)
query q0 = scqG1 scqG2
Hotels
Services
Hotels
Services
single class query
scqG1
single class queryscqG2
L3scqG1
L1scqG2
L2scqG1
L1scqG1
L2scqG2
VENERE
hotels
facilities
map_hotelsLocal
Schema
SAPERVIAGGIARE
hotelsLocal
Schema
GUIDACAMPEGGI
facilities
map_hotelshotels
facilities
hotels
facilities
Local Sche
ma
Query execution on the local sourcesQuery execution on the local sources
MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 7878
DBGroupDATABASE GROUP Query rewriting Query rewriting
SELECT H.name, H.address, H.city, H.price, S.facility, S.structure_name, S.structure_cityFROM hotels as H, services as SWHERE H.city = S.structure_city and H.name = S.structure_nameand H.city = 'rimini‘ and H.price > 50 and H.price < 80 and S.facility = 'air conditioning'order by H.price
q0 =
SELECT H.name , H.address , H.city , H.price FROM hotels as H WHERE (H.city = 'rimini' ) and (H.price > 50) and (H.price < 80)
scqG1 =
SELECT S.facility , S.structure_name , S.structure_city FROM services as S WHERE (S.facility = 'air conditioning')
scqG2 =
SELECT hotels.name2, hotels.address, hotels.price, hotels.city FROM hotels WHERE ((city) = ('rimini') and ((price) > (50) and (price) < (80)))
L3scqG1 =
SELECT maps_hotels.hotels_name2, maps_hotels.hotels_city FROM maps_hotels WHERE (hotels_city) = ('rimini')
L2scqG1 =
SELECT hotels.name, hotels.address, hotels.city FROM hotels WHERE (city) = ('rimini')
L1scqG1 =
SELECT facilities_hotels.hotel_name2, facilities_hotels.hotels_city, facilities_hotels.facility FROM facilities_hotels WHERE (facility) = ('air conditioning')
L1scqG2 =
SELECT facilities_campings.campings_name, facilities_campings.campings_city, facilities_campings.name FROM facilities_campings WHERE (name) = ('air conditioning')
L2scqG2 =
SINGLE CLASS QUERIES
UNFOLDING
UNFOLDING
MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 7979
DBGroupDATABASE GROUP Fusion and ReconciliationFusion and Reconciliation
L3scqG1
result set
L1scqG2
result set
L2scqG1
result set
L1scqG1result
set
L2scqG2
result set
VENERESAPERVIAGGIARE GUIDACAMPEGGI
partial
results
partial
results
partial
results
L1scqG1 full join full join L2scqG1 full joinfull join
L3scqG1
L1scqG2 full joinfull join L2scqG2
scqG1result
set
scqG2result
set
q0 result set
scqG1 joinjoin scqG2
MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 8080
DBGroupDATABASE GROUP Fusion and ReconciliationFusion and Reconciliation
scqG2 result set= L1scqG2 full join L2scqG2
guidacampeggi.facilities full outer join venere.facilities on ((venere.facilities.facility) = (guidacampeggi.facilities.name) AND (venere.facilities.hotels_city) = (guidacampeggi.facilities.campings_city) AND (venere.facilities.hotel_name2) = (guidacampeggi.facilities.campings_name))
scqG1 result set = L1scqG1 full join L2scqG1 full join L3scqG1saperviaggiare.hotels full outer join venereEn.hotels on (
((venereEn.hotels.name2) = (saperviaggiare.hotels.name) AND (venereEn.hotels.city) = (saperviaggiare.hotels.city))) full outer join venereEn.maps_hotels on (((venereEn.maps_hotels.hotels_name2) = (saperviaggiare.hotels.name) AND (venereEn.maps_hotels.hotels_city) = (saperviaggiare.hotels.city)) OR ((venereEn.maps_hotels.hotels_name2) = (venereEn.hotels.name2) AND (venereEn.maps_hotels.hotels_city) = (venereEn.hotels.city)))
SELECT H.name , H.address , H.city , H.price , S.facility , S.structure_name , S.structure_city FROM hotels as H , facilities as S WHERE (H.city = S.structure_city ) AND (H.name = S.structure_name )ORDER BY H.price ASC
q0 = scqG1 result set join scqG2 result set
MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 8181
DBGroupDATABASE GROUP Query ExecutionQuery Execution
The MOMIS Query Manager at work
Top Related