The MOMIS approach for Information Integration
description
Transcript of The MOMIS approach for Information Integration
Domenico Beneventano1
DB
Gro
up @
uni
mo
The MOMIS approach for The MOMIS approach for Information IntegrationInformation Integration
Domenico BeneventanoDomenico Beneventano
Domenico Beneventano2
DB
Gro
up @
uni
mo
Integration : motivation and problemsIntegration : motivation and problems
• Internet has as side effect: the “fragmentation” of information in data sources of small size
• Companies ask for an increased capability of analyzing data from heterogeneous data sources coming from legacy systems
• Efficient Network technologies make it convenient to develop distributed information systems
• An integration process has to take into account the problems deriving from the heterogeneity of data sources :
– Platforms Heterogeneity, it involves hardware/software/DBMSs Solution: different standards, as SQL, ODBC, XML, …
– Structural Heterogeneity, it involves the different conceptual, logic data models and the different representations of the same concept
– Semantic Heterogeneity, it involves the different meaning of terms in data sources (homonymy and synomymy)
Domenico Beneventano3
DB
Gro
up @
uni
mo
Approaches for interoperability and Approaches for interoperability and integrationintegration
• Different approaches have been proposed to guarantee interoperability and integration of data sources:
– Read-only integrated Views : Mediation. The goal of the Mediator based architectures is to obtain an integrated view, read only, of the data stored in different sources.
Essentially, it is necessary to build a component, the Mediator, able to build an integrated schema and to allow users to pose queries with respect to this schema, receive a unified answer, thus disregarding the heterogeneity of the different sources.
– Databases which share information: Federation. This architecture implement a framework that allows database to behave as in a federation. Each source may extend its schema by including subschemas (and data) of other federated databases (Information Integrator IBM).
Domenico Beneventano4
DB
Gro
up @
uni
mo
DB1
User MEDIATOR
Wrapper Wrapper Wrapper
...FileSystem
Wrapper/Mediator Architecture Wrapper/Mediator Architecture
Wrapper extracting the schema of the data sources : local schema
(there is one wrapper for each different kind of data source).
converting the local schema into the common format of the Mediator
sending a query to the data sources exporting the local queries results to the Mediator.
Domenico Beneventano5
DB
Gro
up @
uni
mo
Semantic Integration of Heterogeneous Semantic Integration of Heterogeneous DataData
Global Virtual View(GVV)
The Local Schema of each source is available (source wrapping)
Data integration provides a Global Virtual View (GVV) that is a conceptualization describing the involved sources. allows a user to raise a query and to receive a single
unified answer
Source Source Source
LocalSchema
LocalSchema
LocalSchema
Mapping
Query
Domenico Beneventano6
DB
Gro
up @
uni
mo
Main problems in data integration Main problems in data integration [Lenzerini 2003][Lenzerini 2003]
(Automatic) source wrapping
How to construct the Global Virtual View
How to discover interschema properties among the sources
and mappings between the sources and the Global Virtual
View
How to model the mappings between the sources and the
Global Virtual View
How to process updates expressed on the Global Virtual
View, and updates expressed on the sources (Schema
Evolution)
How to answer queries expressed on the Global Virtual
View (Global Query Management)
Query optimization
Data extraction, cleaning and reconciliation (Extensional
Integration)
Domenico Beneventano7
DB
Gro
up @
uni
mo
MOMIS (Mediator envirOnment for Multiple Information Sources) is a framework to perform information extraction and integration of heterogeneous data sources
Semantic Integration of Information A common data model ODLI3 derived from
ODL-ODMG : standard languages for Object DataBasesI3 (Intelligent Information Integration) is a project of the ARPA (Advanced Research Project Agency) with the objective to define the reference architecture for automatic integration of heterogeneous sources
Tool-supported techniques to construct the Global Virtual View Local Schema Annotation w.r.t. a common lexical ontology
(WordNet) Semi-automatic generation of relationships between local
schemata Clustering techniques Semi-automatic generation of mappings between the GVV
and sources
MOMISMOMIS
Domenico Beneventano8
DB
Gro
up @
uni
mo
MOMIS: Overview of the GVV-generation MOMIS: Overview of the GVV-generation process process
SYNSET2
SYNSET#
SYNSET4
SYNSET1
AUTOMATIC/MANUALANNOTATION
SEMI-AUTOMATICANNOTATION
INFERRED RELATIONSHIPS
LEXICON DERIVEDRELATIONSHIPS
SCHEMA DERIVEDRELATIONSHIPS
CommonThesaurus
COMMON THESAURUSGENERATION
USER SUPPLIEDRELATIONSHIPS
ODLI3LOCAL SCHEMA N
WRAPPING
ODLI3LOCAL SCHEMA 1
…
GVV GENERATION
MAPPING TABLES
GLOBAL CLASSES
clustersgeneration
Structuredsource
RDB
<XML>
<DATA>
Semi-StructuredSource
Domenico Beneventano9
DB
Gro
up @
uni
mo
Person(Name, address)
Professor(Name, belongs_to,rank) FK: Name REF. Person FK: belongs_to REF. Division
Student (Name, year,e-mail) FK: Name REF. Person
Takes(Name, CourseName) FK: Name REF. Student FK: CourseName REF. Course
class Person(key Name){ attribute string Name; attribute string Name; }
class Professor : Person{ attribute Division belongs_to; attribute string rank; }
class Student: Person{ attribute set <Course> takes; attribute string rank; }
class Course{ … }
• Professor IS-A Person • Properties inheritance
Wrapping: from relational to objectWrapping: from relational to objectUNI (University Source) : relational source (Relational DB) ODLI3 schema
Domenico Beneventano10
DB
Gro
up @
uni
mo
Interface Researcher{ attribute string name; attribute string e-mail; attribute set <Article> articles;}
Interface Teaching{ attribute string denomination; attribute string description;}
Interface Class : Teaching{ attribute string name; attribute integer year; attribute string specification; attribute Researcher professor;}…
Example of object sourceExample of object sourceCS (Computer Science Source) : object source (Object DB)
• complex object attributes: articles is a set of Article
• object attributes: professor is a researcher
Domenico Beneventano11
DB
Gro
up @
uni
mo
Wrapping of Web: from pages to dataWrapping of Web: from pages to data
• A large number of Web sites contain now highly structured regions– Large web sites are generated automatically (statically or
dynamically) – Regularities occur at the page level, and at the site level as
well
• We aim at leveraging these regularities in order to face the issues of extracting fine grain data at a large scale
WrapperWrapper
xml documents or relational DB
Writing and maintaining wrappers is a labor intensive (and expensive) task
Road Runner Project (Università di Roma Tre & della Basilicata)
http://www.dia.uniroma3.it/db/roadRunner/
Domenico Beneventano12
DB
Gro
up @
uni
mo
Wrapping of Web pagesWrapping of Web pages• Roadrunner: a tool to automatically generate wrappers for
web pages– Target: Data-intensive Web sites
• HTML pages generated by scripts that extract data from a large database
• Page class: collection of pages generated by the same script => similar structure
• Main Idea: exploit similarities and differences to infer a data extraction program (wrapper)
RoadrunnerWrapper generator
RoadrunnerWrapper generator
Input sample pages
Output wrapperOutput wrapperOutputdata
Pages similar in structure to those of the sample set
Domenico Beneventano13
DB
Gro
up @
uni
mo
Set of terminological (intensional) relationships between terms (classes and attributes names) to express both intra-schema and inter-schema knowledge:
< Ti SYN Tj > Synonymy < Ti BT Tj > Broader Terms (BT) : hypernymy
< Ti NT Tj > (Narrower Terms - NT) hyponymy
< Ti RT Tj > (Related Terms - RT) positive association or holonomy
Common Thesaurus Generation:q lexicon derived relationships q schema derived relationshipsq designer supplied relationshipsq inferred relationships
Common ThesaurusCommon Thesaurus
Domenico Beneventano14
DB
Gro
up @
uni
mo
Schema-derived RelationshipsSchema-derived Relationships
automatic extraction of intra schema relationships from each schema separatelyo BT from IS-A Relationships
< CS.TEACHING BT CS.CLASS>
o BT from relational foreign keys among primary keys
< UNI.PERSON BT UNI.PROFESSOR>
o RT from object attributes
< CS.CLASS RT CS.RESEARCH>
o RT from relational foreign keys
< UNI.PROFESSOR RT UNI.DIVISION>
Domenico Beneventano15
DB
Gro
up @
uni
mo
NT
Hyponymy
Word Form Meaning (synset) teaching course … class education imparted in a series of lessons or class meetings
√ √
activities that impart knowledge
√
the profession of a teacher
Annotation and Lexicon-derived Annotation and Lexicon-derived RelationshipsRelationships
<UNI.COURSESYN CS.CLASS><CS.COURSE NT UNI.TEACHING>
Lexicon-derived relationships:Lexicon-derived relationships:
Lexicon relationships among terms of the sources can be derived on the basis of the semantic relationships between synset in WordNet (Hyponymy / Hypernymy, Meronymy, Correlation (between synsets having the same Hypernym), …) :
LOCAL SOURCE ANNOTATION : To assign meanings of class and attribute names w.r.t. a common lexical ontology (WordNet)
Domenico Beneventano16
DB
Gro
up @
uni
mo
Extending WordNet: WordNet EditorExtending WordNet: WordNet Editor
If a class or attribute name has no correspondent in WordNet, the designer may add a new meaning and proper relationships to the existing meanings.
The designer may add a new meaning (for an existing word-form or for a new one) by: writing the gloss explicitly, or using an existing synset chosen among a list of
candidates obtained by an explicit search (using one or more keywords) or by exploiting similarity search techniques.
The designer may add relationships for the new synset Related synsets are obtained by an explicit
search (using one or more keywords) or by exploiting similarity search techniques.
Domenico Beneventano17
DB
Gro
up @
uni
mo
Relationships derived as a logical consequence from a set of other Relationships. As a simple example, from
<Researcher RT Department> and < Researcher BT Professor>
we can derive that : <Professor RT Department>
Inferred RelationshipsInferred Relationships
The inferences are based on Description Logics techniques
Description Logics (DLs), describe knowledge in terms of concepts and their roles.
DLs provide a means of describing structured knowledge in a way that one can reason about it efficiently (i.e., infer new knowledge).
DLs are subsets of the First-Order Logic, providing decidability and computationally feasible reasoning at the cost of expressiveness.
ODB-Tools : DLs techniques for Object DataBase (www.dbgroup.unimo.it/ODB-Tools.html):
Domenico Beneventano18
DB
Gro
up @
uni
mo
Clustersgeneration
Affinityevaluation
Common Thesaurus
Clusters of Classes
Affinity evaluation among classes: Name Affinity : computed by exploiting
relationships in the Common Thesaurus; a weight (W) is assigned to each kind of relationships, with WSYN >= WBT >= WRT
Structural Affinity: measures the affinity of class attributes
Global Virtual View GenerationGlobal Virtual View GenerationAffinity evaluation and Clusters Affinity evaluation and Clusters
GenerationGeneration
Clusters generation : Classes with affinity are grouped together in clusters by employing a
hierarchical clustering technique.
ARTEMIS tools (Università di Milano) https://islab.dico.unimi.it/artemis/.
Domenico Beneventano19
DB
Gro
up @
uni
mo
Research_S
taff
Research_Staff
CS_Person
School_Member
Professor
Student
Univ_Student
...
CS
_Person
School_M
ember
Professor
Student
Univ_S
tudent
Section
Course ...
- 0.54 0.45 0.42 0.43
- - 0.6
0.68 0.68 0.56
- - - 0.45 0.65 0.6
- - - - 0.62 0.43
- - - - - 0.51
- - - - - -
0.36 0.39
0.2 0.2
0.2 0.2
0.25 0.25
0.25 0.25
0.2 0.2
0.5
Affinity MatrixAffinity Matrix
Domenico Beneventano20
DB
Gro
up @
uni
mo
Affinity Matrix
Clustering Algorithm
Clustering Algorithm: (Hierarchical clustering techniques - Heveritt
74)
Cluster GenerationCluster Generation
Section Course
Room
CS_Person ProfessorStudent
School_Member
Univ_Student
Research_Staff
Division Department
Location
CL2
CL1
CL3
CL5
CL40.66
0.39
0.39
0.375
0.25
0.6
0.540.6
0.65
0.68
0.68
Domenico Beneventano21
DB
Gro
up @
uni
mo
Common Thesaurus& Clusters
Global Virtual View
Mapping Table
generation
Global Classes generation :A global class C=(L,GA) is generated for each cluster : L are the local classes of the cluster GA are the global attributes of C
o Union of the local attributeso Fusion of “similar attributes” (by using the
Common Thesaurus)
Global Virtual View GenerationGlobal Virtual View GenerationGlobal Classes and Mapping Table Global Classes and Mapping Table
GenerationGeneration
Mapping generation : For each global class C=(L,GA), a Mapping Table (MT) is generated, to represent the mappings between global and local attributes MT is a table GAXL : An element MT[GA][L]
represents the attributes of the local class L mapped into the global attribute GA.
Global Classes and Attributes
generation
Domenico Beneventano22
DB
Gro
up @
uni
mo
GVV and MT generation : exampleGVV and MT generation : example
CS.Teaching UNI. Course CS.Class Name denomination name name Description description specification Year year year Period period Professor professor
C= {UNI.Course, CS.Teaching, CS.Class}C= {UNI.Course, CS.Teaching, CS.Class} Cluster
Mapping Table of C
MT generation : Since CS.denomination NT UNI.Name, these local attributes correspond to the same global attribute GA
GVV annotation : the name and the meaning of the class C corresponds to the
name and the meaning of CS.Teaching (the broadest class) the name and the meaning of GA corresponds to the name and
the meaning of CS.description
Domenico Beneventano23
DB
Gro
up @
uni
mo How to model the mappings between How to model the mappings between
the sources and the GVV?the sources and the GVV?
Is the GVV defined in terms of the sources? Approach global-schema-centric, or global-as-view (GAV)
Are the sources defined in terms of the GVV? Approach source-centric, or local-as-view (LAV)
A mixed approach? Approach called GLAV
GVV :
Teaching(Name,Description,Year,Period,Professor)
Sources :CS.Teaching(Denomination,Description)UNI.Course(Name,Year,Period)
CS.Class(Name,Specification,Year,Professor)
Domenico Beneventano24
DB
Gro
up @
uni
mo
Local-as-ViewLocal-as-View
Local-as-View (LAV) mappings: associated to source classes we have views over
the GVVGVV :
Teaching(Name,Description,Year,Period,Professor)
Sources :CS.Teaching(Denomination,Description)UNI.Course(Name,Year,Period)CS.Class(Name,Specification,Year,Professor)
LAV mappings :CS.Teaching --> SELECT Name,Description FROM
TeachingUNI.Course --> SELECT Name,Year,Period FROM
TeachingCS.Class --> SELECT Name,Description
Year,ProfessorFROM Teaching
Domenico Beneventano25
DB
Gro
up @
uni
mo
Global-as-ViewGlobal-as-View Global-as-View (LAV) mappings:
associated to classes in the GVV we have views over the sourcesGVV :
Teaching(Name,Description,Year,Period,Professor)
Sources :CS.Teaching(Denomination,Description)UNI.Course(Name,Year,Period)CS.Class(Name,Specification,Year,Professor)
GAV mappings :Teaching --> SELECT
f1(CS.Teaching.Denomination,…) AS Name, f2(…) As Description, …
FROM CS. Teaching FULL JOIN UNI.Course ON (…)
FULL JOIN CS.Class ON (…)
Domenico Beneventano26
DB
Gro
up @
uni
mo GAV and LAV comparison (from Lenzerini GAV and LAV comparison (from Lenzerini IJCAI 2003)IJCAI 2003)
Domenico Beneventano27
DB
Gro
up @
uni
mo
Global-as-View: the MOMIS approachGlobal-as-View: the MOMIS approach
• Global-as-View (GAV) mappings: for each global class C of the GVV we must define a conjunctive query QC over the local classes of C.
• GAV mappings: the MOMIS approachStarting from the Mapping Table of C, the integration designer, supported by the Ontology Builder graphical interface, can implicitly define QC by:
• using and extending the Mapping Table with• Data Conversion Functions from local to global
attributes• Join Conditions among pairs of local classes belonging
to C• Resolution Functions for global attributes to solve
data conflicts of local attribute values.
– using and extending the Full Disjunction operator
Domenico Beneventano28
DB
Gro
up @
uni
mo
Building the Mappings: an exampleBuilding the Mappings: an example
StringConc (REGION,STATE)
from T_L1 full join T_L2
Data Conversion Functions
Join Attribute
on (T_L1.COMPANY_ID = T_L2.COMPANY_ID)
Join Conditions
FullDisjunction
L1.company L2.company
COMPANY_ID COMPANY_ID COMPANY_ID
SUBCONTR SUBCONTR
CAPITAL_STOCK CAPITAL_STOCK
REGION REGION REGION, STATE
ADDRESS ADDRESS ADDRESS
...Select COMPANY_ID,
precedence(T_L1.ADDRESS, T_L2.ADRESS) as Address, T_L2.SUBCONTRACTOR, …
Resolution Functions
Precedence(L1,L2)
Domenico Beneventano29
DB
Gro
up @
uni
mo
Join Conditions and Full DisjunctionJoin Conditions and Full Disjunction• Object Identification: Merging data from different sources requires
different representations of the same real world object to be identified.
• Join Conditions: To identify instances of the same object and fuse them among pairs of local classes belonging to the same global class. As an example, for the global class Company the designer can define JC(L1.Company,L2.Company) :
L1.COMPANY_ID = L2.COMPANY_ID
• Full Disjunction (FD) “computing the natural outer-join of many relations preserving all possible connections among facts”
• Given a global class C composed of L1,L2, ..., Ln we consider
FD(T(L1), T(L2), . . . , T(Ln)) computed on the basis of the Join Conditions
where T(L) denotes L transformed by the Data Conversion Function, i.e., the full disjunction operator is applied after data conversion.
Domenico Beneventano30
DB
Gro
up @
uni
mo
FD(L1,L2) : select S(L1) union S(L2) from L1 outer join L2 on JC(L1,L2)
Full Disjunction: an exampleFull Disjunction: an example
COMPANY_ID ADDRESS CAP_STOCK
123 Via Ugo 55
4 Via Po, 2 65
COMPANY_ID ADDRESS SUBCONTR
4 Via Larga 3 Yes
234 Via Verdi 9 No
L1.COMPANY_ID L1.ADDRESS L1.CAP_STOCK L2.COMPANY_ID L2.ADDRESS L2.SUBCONTR
123 Via Ugo 55
4 Via Po, 2 65 4 Via Larga 3 Yes
234 Via Verdi 9 No
L1=L1.Company L2=L2.Company
where JC(L1,L2) = L1.COMPANY_ID = L2.COMPANY_ID
Domenico Beneventano31
DB
Gro
up @
uni
mo
Resolution FunctionsResolution Functions
• The fusion of data coming from different sources has to take into account the problem of inconsistent information among sources.
• In the context of MOMIS we adopt the Resolution Function. – A Resolution Function may be defined for each global
attribute mapping onto local attributes coming from several local sources, in order to solve data conflicts due to different local attribute values.
– Homogeneous Attributes : If there are no data conflicts for a global attribute mapped onto more than one source
– As an example, in GVV.Company, we define all the global attributes as Homogeneous Attributes except for Address where we used a precedence function:
L1.ADDRESS has a higher precedence than L2.ADDRESS
Domenico Beneventano32
DB
Gro
up @
uni
mo
Resolution FunctionsResolution Functions
COMPANY_ID ADDRESS CAP_STOCK SUBCONTR
123 Via Ugo 55
4 Via Po 2 65 Yes
234 Via Verdi No
L1.COMPANY_ID L1.ADDRESS L1.CAP_STOCK L2.COMPANY_ID L2.ADDRESS L2.SUBCONTR
123 Via Ugo 55
4 Via Po 2 65 4 Via Larga 3 Yes
234 Via Verdi No
Application of the resolution functions
Domenico Beneventano33
DB
Gro
up @
uni
mo
FD Computation with more that 2 classesFD Computation with more that 2 classes• [Rajarama, Ullman - PODS 1996] : There is a outerjoin sequence
producing FD if and only if the set of relation forms a connected, acyclic hypergraph.
A Global Class C with more than 2 local classes is a cyclic hypergraph we need a new method
Example with n = 3 :
L1
L2 L3
JC(L1,L3)JC(L1,L2)
JC(L2,L3)
• The computation of FD is performed assuming: 1. each L contains a key, 2. all the join conditions are on key attributes, 3. all the join attributes are mapped into the same set of global attributes (K).
Then, FD can be computed as: L1 full join L2 on JC(L1,L2))
full join L3 on (JC(L1,L3) OR JC(L2,L3)
Domenico Beneventano34
DB
Gro
up @
uni
mo
Global Query ManagementGlobal Query Management The querying problem:
How to answer queries expressed on the GVV (global queries)?
Query rewriting : to rewrite a global query as an equivalent set of queries expressed on the local schemata (local queries).
GAV approach: the query is processed by means of unfolding :
by expanding a query on a global class C of the GVV according to the definition of QC
Fusion and Reconcilation of the local answers into the global answer
Object Fusion & Resolution Functions
Inconsistencies between sources
Query Optimization
Domenico Beneventano35
DB
Gro
up @
uni
mo
Q: SELECT COMPANY_ID, REGION, ADDRESS FROM company WHERE CAPITAL_STOCK > 50 AND REGION LIKE '*VENETO*' AND SUBCONTR LIKE ’yes’
LQ1SELECT COMPANY_ID, REGION,ADDRESS, FROM L1.company WHERE (REGION like '*VENETO*' and CAPITAL_STOCK > 50)Local
queries
Query unfolding exampleQuery unfolding example
Global Query
Global Class: Company = { L1.Company, L2.Company}
LQ2SELECT COMPANY_ID, REGION,STATE,ADDRESS, FROM L2.company WHERE ( strconc(REGION,STATE) like '*VENETO*' and
SUBCONTR like 'yes')
Domenico Beneventano36
DB
Gro
up @
uni
mo
Query unfolding exampleQuery unfolding example
Full disjunction computation: FD : Select *
from T_LQ1 full outer join T_LQ2on (T_LQ1.COMPANY_ID = T_LQ2.COMPANY_ID)
Query Answer Select COMPANY_ID, REGION,
precedence(T_LQ1.ADDRESS, T_LQ2.ADDRESS) AS ADDRESSfrom RP
Residual predicate: RP: SELECT *
FROM FD WHERE CAPITAL_STOCK > 50 AND SUBCONTR LIKE ’yes’
Q: SELECT REGION, ADDRESS FROM company WHERE CAPITAL_STOCK > 50 AND REGION LIKE '*VENETO*' AND SUBCONTR LIKE ’yes’