SeMap: A Generic Schema MatchingSystem
by
Ting Wang
B.Sc., Zhejiang University, 2004
A THESIS SUBMITTED IN PARTIAL FULFILMENT OFTHE REQUIREMENTS FOR THE DEGREE OF
Master of Science
in
The Faculty of Graduate Studies
(Computer Science)
The University Of British Columbia
August, 2006
c© Ting Wang 2006
Abstract
The rapidly growing number of autonomous data sources on the web makesthe need of effective tools of creating semantic mappings increasingly crucial.Moreover, the goal of allowing applications to have more expressive seman-tics requires a change in focus. While most previous work focus on creatingmappings in specific data models for data transformation, they fail to cap-ture a richer set of possible relationships between schema elements. Forexample, current schema matching approaches might discover that ‘TA’ inone schema equals to ‘grad TA’ in another one, even though the relationshipcan be modeled more accurately by saying that ‘grad TA’ is a specializa-tion of ‘TA’. This increased semantics of the mapping in turn allows forapplications involving richer semantics.
In this thesis we concentrate on the following problem: given initialmatch (correspondence) information produced by current schema match-ing techniques, how to construct a complex, semantically richer mappingthat can be used across data models? Specifically, we aim at detecting therelationship types of ‘Has-a’, ‘Is-a’, ‘Associates’ and ‘Equivalent ’. Tech-nically, we achieve this goal in mainly three steps: (1) exploiting varioustypes of semantic evidence for possible matches; (2) finding a globally op-timal match assignment; (3) identifying the relationship embedded in theselected matches. We implemented our semantic matching approach withina prototype system SeMap, and tested its accuracy and effectiveness.
ii
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Relationship Classification . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Equivalence Relationships . . . . . . . . . . . . . . . . 102.1.2 Set-Theoretic Relationships . . . . . . . . . . . . . . . 112.1.3 Generic Relationships . . . . . . . . . . . . . . . . . . 11
2.2 Schema Matching Techniques . . . . . . . . . . . . . . . . . . 122.2.1 Rule-Based Solutions . . . . . . . . . . . . . . . . . . . 132.2.2 Learning-Based Solutions . . . . . . . . . . . . . . . . 15
2.3 Ontology Alignment Techniques . . . . . . . . . . . . . . . . . 162.4 Sample Prototypes . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Rondo . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.2 Cupid . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.3 COMA . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.4 iMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 213.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 22
iii
Table of Contents
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Semantic Resources . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Internal Resources . . . . . . . . . . . . . . . . . . . . 273.3.2 External Resources . . . . . . . . . . . . . . . . . . . . 29
3.4 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . 303.4.1 Schema Matcher . . . . . . . . . . . . . . . . . . . . . 313.4.2 Match Selector . . . . . . . . . . . . . . . . . . . . . . 333.4.3 Mapping Assembler . . . . . . . . . . . . . . . . . . . 35
4 SeMap System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.1 Schema Matcher . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Base Matcher . . . . . . . . . . . . . . . . . . . . . . . 384.1.2 Similarity Score and Lineage Information . . . . . . . 394.1.3 Element-Level Matcher . . . . . . . . . . . . . . . . . 394.1.4 Structure-Level Matcher . . . . . . . . . . . . . . . . . 464.1.5 Architecture of Schema Matcher . . . . . . . . . . . . 48
4.2 Match Selector . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2.1 Representation . . . . . . . . . . . . . . . . . . . . . . 514.2.2 Bidirectional search . . . . . . . . . . . . . . . . . . . 534.2.3 Modeling user interaction . . . . . . . . . . . . . . . . 55
4.3 Mapping Assembler . . . . . . . . . . . . . . . . . . . . . . . 574.3.1 Combining Maps and Mapt . . . . . . . . . . . . . . . 584.3.2 Identifying relationships . . . . . . . . . . . . . . . . . 584.3.3 Assembling mapping . . . . . . . . . . . . . . . . . . . 62
5 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . 645.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . 65
5.1.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 655.1.2 Expert Mapping . . . . . . . . . . . . . . . . . . . . . 665.1.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 675.1.4 Experimental Methodology . . . . . . . . . . . . . . . 68
5.2 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . 695.2.1 Matching Accuracy . . . . . . . . . . . . . . . . . . . . 705.2.2 Component Contribution . . . . . . . . . . . . . . . . 745.2.3 Incorporating User Feedback . . . . . . . . . . . . . . 745.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 76
6 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . 78
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
iv
List of Tables
5.1 Characteristics of the input schemas. . . . . . . . . . . . . . 665.2 Characteristics of the expert mappings. . . . . . . . . . . . . 67
v
List of Figures
1.1 An example of input schemas and output mapping. . . . . . . 2
2.1 A classification of current schema matching techniques. Cour-tesy of [22]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Representation of model. The left plot shows a graphicalrepresentation of a model, comprised of nodes (elements) andedges (relationships). The right table shows the tuple repre-sentation of edges. . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Illustration of four relationship types handled by SeMap. . . . 253.3 An example of complex mapping handled by SeMap. . . . . 263.4 Illustration of the matching process. . . . . . . . . . . . . . . 263.5 The basic system architecture of SeMap. It takes two models
and external resources as input, and produces generic seman-tic mapping. It consists of three main parts: the schemamatcher, the match selector and the mapping assembler. . . 31
4.1 Architecture of schema matcher. It consists of three layers,base matcher, combining layer and structure matcher. . . . . 48
4.2 Partial match assignments from the perspectives of sourceand target schemas respectively. . . . . . . . . . . . . . . . . . 55
4.3 Mapping assembling for matches of different types. Each 1-1 equivalence match corresponds to one mapping element,while each element of complex match is associated with onemapping element. . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1 Matching accuracy of SeMap. The three plots show the recall,precision and F-measure of the matching results for the threerelationship types Equivalent, Has-a, Is-a and total correctmatches respectively. . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Error analysis of the resulting mappings. . . . . . . . . . . . 73
vi
List of Figures
5.3 The precision of SeMap after pruning incorrect matches. Thebars from left to right shows the matching results for the threerelationship types Equivalent, Has-a, Is-a respectively. . . . 73
5.4 Relative contribution of different types of semantic evidencesto the matching results of SeMap. The two plots (from up todown) show the F-measure of identified matches (correspon-dences) and identified relationships respectively. . . . . . . . 75
5.5 F-measure of correct correspondences versus the amount ofuser interaction (percentage of expert matches provided overthe total number of matches). The curves for four datasets(Real Estate 1/2, Course Info 1/2) are shown. . . . . . . . . 77
vii
Acknowledgements
I would like to express my gratitude to all those who have offered me help incompleting this thesis. Especially, I owe the greatest thanks to my supervisorRachel Pottinger, who provided me with excellent guidance and support inthe entire process of this thesis project. I want to thank Dr. Tsiknis forgiving me insightful comments on this work, and being my second reader.
I would like also to thank all the members at database management lab,especially Jian Xu for their constructive suggestions. Without their help,this work would not be possible. Finally, I thank all my friends at theUniversity of British Columbia. It has been a wonderful experience to growup with them.
viii
Chapter 1
Introduction
1.1 Motivation
Spurred by the growth of data sources on the web, information systems are
witnessing a paradigm shift from monolithic databases to heterogeneous,
interacting data sources. The fundamental problem in sharing data from
multiple sources is to deal with the semantic heterogeneity inherent in their
autonomous nature, and the key is to identifying the semantic correspon-
dences between them. The operation of finding such correspondences is
called Match, which takes two schemas as input and produces a semantic
mapping, specifying the relationships between elements of the two schemas.
Such semantic mappings play a crucial role in numerous data sharing ap-
plications, including web data integration, schema evolution and migration,
component-based development, etc.
Currently, the creation of semantic mappings, especially complex ones is
still mostly done manually, possibly supported by a graphical user interac-
tion interface. Manually creating semantic mappings is a tedious, error-
prone process. The labor-intensity grows linearly as the matches to be
performed. Hence the rapidly increasing number of web data sources ne-
cessitates automatic support for schema matching.
1
Chapter 1. Introduction
The problem of semi-automatically creating mappings has attracted on
intensive research in both the database and AI communities [2, 4, 10, 15, 28].
The procedure is comprised of two phases, schema matching and mapping
construction. In schema matching, equivalence correspondences between el-
ements of both schemas are identified. The equivalence correspondences can
be one-to-one (1-1) matches, e.g., ‘class’ corresponds to ‘course’, or complex
matches containing more than one element in each schema, e.g., ‘TA’ maps
to some combination of ‘grad TA’ and ‘ugrad TA’. Note that the focus of
schema matching is to find such potential correspondences, rather than giv-
ing a final mapping to the users. Finding this mapping is done in mapping
construction, where the identified correspondences are built on by adding
more specific semantic information to generate a semantically rich mapping.
professorfaculty
ugrad TA
dept
m2
(=)
m3
(=)
m4
(=)
m1
(=)
m5
(=)instructor
m8
(=)
m6
(=)
m7
(=)
m9
(=)
grad TA
college
class
schema S
Associates
schema T
course
Associates
Is-a
Is-a
Associates
Is-a
Map S T
TA
Has-a
Has-a
Has-a Has-a
Has-a
Has-a
Is-a
Figure 1.1: An example of input schemas and output mapping.
As a typical example of mapping construction, Clio [32] includes a set
of user-interaction techniques to create SQL-style mappings, based on the
output of an initial schema match. Such semantic mappings are necessary
2
Chapter 1. Introduction
to transform data. Clio however, like most other previous work on mapping
construction, is restricted to relational and XML-style schemas; it does not
capture the general richness of the possible relationships between elements
in a data-model-independent fashion. Thus, although many common rela-
tionship types exist across SQL and XML (e.g., specialization), this work
cannot be used to create the XML-style mapping.
Data sources on the web however are, of various data models, e.g., XML,
HTML, RDF, ontologies, text, etc. Hence exploring how to create richer,
general relationships between schema elements, rather than concentrating
on the specific data model under consideration, allows us to understand
the general space of the possibilities. It also allows better reuse of ideas,
since one does not have to create a separate algorithm for each ensuing data
model. After a mapping with such general relationships is constructed, the
transformations into a specific data model can be made more concretely. For
example, it can be easily transformed into specific forms, e.g., SQL views
or XSLT transformations, thus excluding the need of maintaining specific
mappings separately. Also, a generic mapping can create a uniform in-
terface between domain knowledge (ontologies) and web interface (database
schemas), which is helpful for semantic web applications. Furthermore it can
be fed into a model management system [17], which aims to solve meta-data
problems in a data model neutral fashion, or used for knowledge inference
when applied to ontology domain.
An example of a generic semantic mapping is shown in Figure 1.1, where
two schemas S and T represent the concepts of ‘class’ and ‘course’ respec-
tively. A generic mapping S T is constructed, specifying a rich collection of
3
Chapter 1. Introduction
semantic relationships between the elements of S and T , e.g., ‘college’ of T
‘Has-A’ ‘dept’ of S, while ‘instructor’ of S ‘Is-A’ ‘faculty’ of T . The relation-
ship types adopted in this thesis follow the relationship classification of [21].
Compared with the equivalence relationships (1-1 or complex) considered in
previous literature, this relationship classification is semantically richer and
more expressive. Equipped with such generic mappings, one can envision a
number of applications. For example, one problem facing current semantic
web applications is the lack of domain specific knowledge (e.g., ontologies).
If domain knowledge in different representations can be mutually converted,
the collection of knowledge will be significantly enriched.
1.2 Contribution
In this thesis we explore constructing such generic semantic mappings, based
on initial match information that shows correspondences between the ele-
ments of both schemas. This initial match information can be produced by
current schema matching techniques.
Mapping construction takes as input a set of initial matches produced
by a set of schema matching algorithms, and generates a semantically richer
mapping, such as the one in Figure 1.1, which describes complex relation-
ships between elements of both schemas. Specifically, mapping construction
is responsible for searching for a global optimal match assignment from
the pool of possible assignments, solving the conflicts among the selected
matches, and identifying the complex relationships between the schema el-
ements, e.g., the ‘Has-A’ relationships in Figure 1.1. However, constructing
4
Chapter 1. Introduction
a generic semantic mapping is fundamentally difficult for several reasons:
• Finding correspondences with generic semantic relationships is sub-
stantially harder than simple equivalence, since the space of possibil-
ity under consideration is much larger, and more semantic evidence is
needed;
• The pool of initial matches is possibly quite large. This search space
is large enough in considering n:1 equivalence matches to make most
matching algorithms only consider 1:1 matches, but when relationships
other than simple equivalence are considered, it is infeasible to try all
possible combinations to find the optimal assignment;
• Various semantic constraints can be imposed, rendering match selec-
tion a complicated constrained optimization problem;
• Identifying the relationships implicit in matches is a hard problem,
and one that is made more difficult by attempting to make our output
data model independent.
As in schema matching, mapping construction inherently can not be
fully automatic. The importance of user feedback is recognized in schema
matching research [4, 31], however no systematic modeling of user interaction
for mapping construction is available to date. One of the goals of our work
is to limit interaction to critical points to help focus user attention and
minimize user effort.
Aiming at overcoming the problems listed above, in this thesis we de-
scribe a prototype system SeMap to create a generic, semantic mapping. We
5
Chapter 1. Introduction
choose a graph-based representation that is similar to that used in model
management [17], which is expressive enough to accommodate both schemas
of many types and other meta-data, such as ontologies. Specifically, we make
the following contributions:
• An architecture for semi-automatically constructing generic semantic
mappings based on initial correspondence information;
• A novel probabilistic framework that incorporates match uncertainty
and semantic constraints in a uniform way, and expresses match selec-
tion to a mathematical optimization problem;
• Effective modeling of user interaction to help focus user attention and
minimize user effort, by detecting critical points where feedback is
maximally useful;
• Effective solution to extracting implicit relationship of initial match
based on various types of semantic evidences;
• A prototype system embodying the innovations above and a set of
experiments to illustrate the correctness and effectiveness of our ap-
proach.
1.3 Organization
This thesis is a specification of our schema matching system SeMap. The
goal is to present the technical details in implementing the system. Specifi-
cally, we intend to make clear mainly the following three aspects:
6
Chapter 1. Introduction
1. The formulation of the problem, including the exact representation of
the input/output of the system, the resources we use and the assump-
tions we have made;
2. The specification of the system, including the system architecture, the
exact input/output and interior structure of each component and their
interaction;
3. The experimental analysis, including the dataset we can use, the met-
ric we use to evaluate our approach, the experimental result and its
explanation.
The remainder of the thesis will be organized as follows: Chapter 2 presents a
survey of related work; Chapter 3 formally defines the problem of mapping
construction and gives an overview of the architecture of our system. In
Chapter 4, we describe our mapping construction approach in more details;
Chapter 5 presents the experimental analysis of our approach; and Chapter 6
concludes this thesis and presents future work.
7
Chapter 2
Related Work
Semi-automatically creating semantic mappings has attracted upon inten-
sive research in both the database (schema matching) and AI (ontology
alignment) communities. The key differences and similarities of schema
matching and ontology alignment include:
• Differences. Ontologies are logical systems, which obey some formal
semantics, i.e., they can be interpreted as a set of logical axioms;
however database schemas often provide no explicit semantics for their
data.
• Similarities. Schemas and ontologies are quite similar in the sense that
(1) they both provide a vocabulary of terms that describe a domain
of interest and (2) they both constrain the meaning of terms used in
the vocabulary [30].
Due to their differences, schema matching is usually performed with the
techniques to guess the semantics implicit in the schemas, while ontology
alignment is designed to exploit the knowledge explicitly encoded in the
ontologies. Their similarities however make the solutions from these two
problems mutually beneficial. Following, we will discuss the problems of
schema matching and ontology alignment as a whole.
8
Chapter 2. Related Work
In this chapter, we present a survey of related work in three parts: first
in Section 2.1 we classify the current schema matching/ontology alignment
techniques based on the relationships they can handle; we then discuss some
typical techniques used in these approaches, specifically, schema matching in
Section 2.2 and ontology alignment in Section 2.3; finally, we present several
example prototype matching systems in Section 2.4.
2.1 Relationship Classification
The relationship types created by matching techniques can be roughly di-
vided into three categories: equivalent relationship, set-theoretic relation-
ships and generic relationships. Specifically, two schema elements having
the equivalent relationship means they are semantically equivalent, and the
techniques to identify equivalent relationship is described in Section 2.1.1;
the set-theoretic relationship classification regards each schema element as a
set, and specifies their relationship as one of equivalence, subsumption, inter-
section, disjointness and incompatibility, which is discussed in Section 2.1.2;
the generic relationships refer to those non-equivalent relationships, such
as Has-a and Is-a relationships discussed in this thesis. Two typical clas-
sification of generic relationships can be found in ontology modelling [18]
and meta-data management [21]. The techniques developed so far to handle
generic relationships is presented in Section 2.1.3.
9
Chapter 2. Related Work
2.1.1 Equivalence Relationships
With the main goal of data transformation in specific data models, most
schema matching/ontology alignment algorithms to date aim at discover-
ing the equivalence relationships [2, 3, 4, 10, 13, 14, 16, 31]. The found
equivalence correspondence can be the case of a 1-to-1 match (e.g., ‘course’
= ‘class’), or a complex match (e.g., ‘name’ = concat(‘first-name’ +‘last-
name’)).
The complexity of creating multi-arity (1-to-n or even n-to-m) matches
is significantly harder than 1-to-1 matches for several reasons: (1) while the
number of candidate matches is bounded for 1-to-1 match (the product of
the sizes of two schemas), the number of match candidates to be considered
in complex case is much larger. (2) it is inherently difficult to generate a
match to start with in the case of multi-arity matches. That is in the case
of n-to-m match, it is difficult to determine n and m in order to generate a
set of candidate matches. Hence to date most the work on schema matching
has been focused on discovering 1-to-1 equivalence correspondences between
schema elements [3, 4, 10, 13, 14, 16, 31]. R. Dhamankar et al. [2] proposed
iMAP, a prototype of identifying 1-to-n correspondence matches, which re-
formulates schema matching as a search in an often very large match space.
To search effectively, it employs a set of searchers, each discovering specific
types of complex matches.
However, while attempting to discover semantically equivalent corre-
spondences, it is possible that the matches identified by these techniques
may not be exactly of equivalence relationships; they may instead be the
10
Chapter 2. Related Work
semantically richer relationships we are endeavoring to find, such as the re-
lationship between ‘TA’ and {‘grad TA’, ‘ugrad TA’} as shown in Figure 1.1.
2.1.2 Set-Theoretic Relationships
The equivalence relationship can be considered as a special case of the set-
theoretic relationships, which can specify the relative containment relation-
ship between two sets. In [26], an effective solution is proposed to identify
inter-set relationships by bidirectionally comparing the containment of data
instances and meta-data, of different schema elements. The problem with
this approach is that the data instances associated with the two schemas
should be in the same universe, otherwise the comparison of containment
relationship is not meaningful. However, in many applications, especially
web data integration, the data sources do not overlap.
2.1.3 Generic Relationships
There have been very few works on finding generic relationships between
schema elements. The solution proposed by D. Embley et al [5] relies heav-
ily on a domain-specific ontology to find the relationships of Merge/Split
(e.g., ‘Address’ consists of ‘Street’, ‘City’ and ‘State’), Superset/Subset (e.g.,
‘Phone’ contains both ‘Phone day’ and ‘Phone evening’), and Set-Name as
Value (e.g., the attribute ‘Water-front’ in one schema appears as a value of
the attribute ‘House-description’ in the other schema).
The basic idea is to first map the schema elements to a comprehensive
domain-specific ontology, and the relationships between schema elements
can then be determined by that of their counterparts in the ontology. This
11
Chapter 2. Related Work
approach requires (1) a comprehensive ontology that covers all possible con-
cepts that may appear in schemas in that domain; (2) a domain-specific the-
saurus that can map schema elements to their alternative representations in
the ontology. Such ontology and thesaurus are usually hard to obtain in real
scenarios. Our work has fairly simple requirement for the needed semantic
information, available in most schemas, and does not assume any compre-
hensive ontology. Nevertheless the existence of such ontology can improve
the quality of the matching results of our system.
F. Giunchiglia et al. [7] proposed the concept of semantic matching,
a pure schema-based approach. The basic idea is to first populate each
element name with their meanings in some domain-specific dictionaries, and
computes the specialization relationship of schema elements based on the
containment relationship of their meanings. Their approach however works
only for identifying Is-a relationship and tree-structured schemas.
2.2 Schema Matching Techniques
The research on schema matching/ontology alignment provides a wealth of
techniques to semiautomatically find semantic matches. The techniques can
be classified by the information they exploit [22] as shown in Figure 2.1: the
matches can be found by exploiting one type of semantic evidence (schema-
level, data instance-level, etc), or combining multiple types of evidences (i.e.,
hybrid matchers, which integrate multiple matching criteria, and composite
matchers, which combine results of independently executed matchers [22]).
Matching techniques can also be classified by their methodologies into rule-
12
Chapter 2. Related Work
based and learning-based solutions, which will be discussed in Sections 2.2.1
and 2.2.2 respectively.
schema matching techniques
individual matcher combining matcher
hybrid matcherschema-level combiningindividual matcher
instance-level
element-level structure-level element-level
user feedback
iterative matcher selection
result combination
manually: automatic:
· name similarity · graph matching · value patterns
· data distribution
· frequent term
. . .
. . .
. . .
· type similarity
Figure 2.1: A classification of current schema matching techniques. Cour-tesy of [22].
2.2.1 Rule-Based Solutions
Rule-based matching [7, 14, 16] techniques constitute a wealthy collection
of schema matching solutions, which have been used in both early and cur-
rent matching applications. Rule-based techniques discover similar schema
elements by exploiting schema-level information using hand-crafted rules. A
broad variety of rules have been devised to exploit all possible information,
including element name (label), data types, structures, number of subele-
ments, and integrity constraints. For example, F. Giunchiglia et al. [7]
proposed to exploit the semantic meanings of element names to discover
similar elements; Cupid [14] employs rules that categorize elements based
on names, data types and domains; Similarity flooding measures pairwise
similarity by propagating similarity from some fixed points according to the
schema structures.
13
Chapter 2. Related Work
The rule-based techniques have some desirable features: (1) they are
usually inexpensive in computation and require no training process as in
learning-based approaches; (2) they usually require only schema-level infor-
mation, which is available in most matching scenarios; (3) if some domain
knowledge is available, one can specify domain-specific rules, which can work
very well in certain types of applications. For example, users can write reg-
ular expressions that encode times or phone numbers, or quickly compile a
collection of zip codes to help recognize these types of entities. The learning
methods however can hardly deal with these scenarios. They either can not
learn some complex rules, or require a large amount of training data with
the correct representation for desired result, which is usually hard to obtain.
However the rule-based techniques have several drawbacks: (1) they can
not effectively exploit data-instance level information, even though the data
instances provide valuable information, e.g., precise data format, data dis-
tribution, statistical values, etc. It is possible in some cases that the schema-
level information is opaque or very difficult to interpret, e.g., the element
names like A or B1 are too abstract to be interpreted. In contrast, learn-
ing methods such as Naive Bayes can easily construct some probabilistic
rules that find similarity in such scenarios, based on the distribution of data
instances [11]; (2) moreover, rule-based techniques can not exploit previ-
ous matching results to improve the current matching process. Hence in a
matching application for a specific domain, the rule-based techniques are
usually insufficient.
14
Chapter 2. Related Work
2.2.2 Learning-Based Solutions
Motivated by the drawbacks of rule-based matching methods, a collection of
learning-based solutions have been proposed: these methods have considered
a variety of learning techniques, and exploited both schema-level and data
instance level information. For example, Doan et al. proposed the LSD
system, which employs Naive Bayes learning method over data instances,
and also exploited the structure information of XML data format; The iMAP
system [2] pays attention to the description of elements, in addition to other
schema information.
In developing learning techniques for schema matching, it has been real-
ized that considering only schema-level or data instance-level evidence in the
schemas being matched is often insufficient for a purpose of more accurate
matching. Hence, several types of external resources have been considered
to improve the matching quality. For example, assuming a domain-specific
ontology is available, one technique is to first maps schemas/ontologies into
the ontology, then constructing the matches based on the relationships inher-
ent in that ontology [5]. For example, it is hard to identify the relationship
between ‘direct’ and ‘free toll’ by using regular approaches such as string
comparison. However, by mapping them to a domain-specific ontology, one
can find that they are both specializations of the concept ‘phone’, so that it
can be concluded that ‘direct’ is highly similar to ‘free toll’.
Some recent works advocate exploiting past matching results to improve
current ones [3, 4], with the basic idea of learning from past matches to pre-
dict unseen matching scenarios. An alternative solution considers learning
15
Chapter 2. Related Work
from a corpus of schemas and matches [14]. Such corpus provides alternative
representations of concepts in the domain, i.e., functions in the same way
as ontology, thus can be leveraged to discover similarity between schema el-
ements. However, it is not always practical to have such external resources,
particularly since such these resources must be domain-specific to be effec-
tive.
2.3 Ontology Alignment Techniques
Ontology alignment deals with finding corresponding concepts in different
ontologies. In this section, we present some typical work on ontology align-
ment, and a comprehensive survey is referred to [10].
OntoMorph [1] focused on the problem of translation of symbolically
represented knowledge between different knowledge representations. It used
a description logic based approach, and offers syntactic rewriting to support
the translation between two different knowledge languages, and semantic
rewriting to support inference-based transformation. OntoMorph requires
users to provide transformation rules, thus can be regarded as one type of
rule-based technique.
Prompt [20] proposed an ontology alignment mechanism that finds cor-
responding concepts by refining an initial mapping (pairs of anchors) given
by users or some simple linguistic matching approaches. Specifically, it ana-
lyzes the paths in sub-graphs limited by the anchors and determines which
concepts frequently appear in similar positions on similar paths. The phi-
losophy followed by Prompt is similar to that of Similarity Flooding [16].
16
Chapter 2. Related Work
FCA-Merge [28] is an example of alignment technique depending on
external resource. The resources used in FCA-Merge are domain-specific
documents, which cover the concepts in the ontologies. Through natural
language analysis techniques, it generates a formal context for each docu-
ment, which tells which documents contain which concepts. Based on these
formal contexts, the Is-a relationships between concepts are inferred. How-
ever since the formal context is built upon the generalization/specialization
hierarchy of the concepts, this approach could not be extended to other re-
lationships, such as Has-a. Moreover, the requirement of domain-specific
documents is not always feasible.
MAFRA [15] proposed a framework for sharing distributed ontologies via
mapping. A multi-strategy process is employed to calculate the similarities
between ontology entities, including lexical similarity, property similarity
(attributes or relations). Both top-down and bottom-up similarity propaga-
tions are employed. This can be considered as a counterpart of the hybrid
matching techniques in schema matching.
To our best knowledge, though the ontology itself can have complex re-
lationships, e.g., Has-a or Is-a, the focus of most previous work on ontology
alignment is finding semantically equivalent concepts or one specific type
of relationships, e.g., Is-a in FCA-Merge, in different ontologies [10], rather
than discovering corresponding concepts with more types of generic rela-
tionships, and the rich relationships in ontologies are used only as one type
of semantic evidence.
17
Chapter 2. Related Work
2.4 Sample Prototypes
In this section, we consider some recent prototype of schema matching sys-
tems.
2.4.1 Rondo
Rondo [17] is a complete prototype of generic model-management system,
in which high-level operators are used to manipulate models and mappings
between models. As one of its main operators, match is implemented using
the Similarity Flooding (SF) algorithm [16]. SF utilizes a hybrid matching
approach based on the idea of similarity propagation. It starts from a string-
based comparison, e.g., common prefixes, suffixes, of the schema elements
names to get an initial mapping, which is further refined using a fix-point
computation. The matching process is well formulated as a mathematical
optimization problem in SF.
2.4.2 Cupid
Cupid [14] implements a hybrid matching algorithm that analyzes syntactic
information at elements (e.g., string prefixes, suffixes), and structure infor-
mation of schemas (e.g., tree matching weighted by leaves). Moreover, it
exploits external resources, i.e., a pre-compiled thesaurus.
2.4.3 COMA
COMA [3] is a composite schema matching system. It provides a matcher
library composed of different matching algorithms. Its framework allows
18
Chapter 2. Related Work
the combination of partial results. The matcher library can be extended
by adding new matching algorithms. Specifically, it contains 6 elementary
matchers, 5 hybrid matchers and one reuse-oriented matcher. Compared
with Cupid, this reuse-oriented matcher is a novel algorithm, which tries to
leverage previously obtained results for new schemas.
2.4.4 iMAP
iMAP [2] is a matching system that considers 1-to-n equivalence matches.
The authors regard the problem of matching as a search in a usually infi-
nite match space. The overall goal is achieved in three steps: (1) a set of
basic matchers, called searchers are employed to detect similar elements ac-
cording to different criteria (e.g., linguistic similarity, numerical equivalence,
etc). Specifically, for each element in the target schema, a set of similar el-
ements are found in the source schema by the searchers, including 1-to-1
and n-to-1 matches. (2) the match candidates generated in the first step are
evaluated by a similarity evaluator module in this step, and the result is a
similarity matrix which indicates the similarity between the target element
and different match candidates. (3) a match selector module selects the
best match candidate as the final result. iMAP also provides a explanation
module which can provide explanation for the generated matches, e.g., the
reason the match is selected, and the implicit equivalence relationship, etc.
To the best of our knowledge, most previous work on schema matching focus
on one-to-one equivalence relationships in finding semantic correspondences
between two schemas. Little work is done in identifying multiple types of
19
Chapter 2. Related Work
complex relationships. In the following chapters, we present SeMap, a pro-
totype schema matching system which is designed to find generic semantic
correspondences.
20
Chapter 3
Problem Formulation
As discussed in the related work (Chapter 2), most work on schema match-
ing so far focuses on finding one-to-one equivalence relationships between
schema elements. The overall goal of our schema matching system, , is to
identify generic semantic mapping between two schemas. And “generic se-
mantic mapping” means (1) the matches may be non one-to-one, e.g., one
element is mapped to multiple elements of the other schema, a.k.a, 1-to-n
matches; (2) the relationship types may be non-equivalence, e.g., Has-a,
Is-a, etc, as classified in Vanilla meta-meta model [21].
An example of a generic semantic mapping is shown in Figure 1.1, where
two schemas represent the concept of ‘class’/‘course’ in different ways. The
mapping contains complex correspondences, such as ‘TA’ of schema S is
mapped to ‘undergrad TA’ and ‘grad TA’ of schema T. Instead of the equiv-
alence relation considered in most schema matching approaches, the rela-
tionship types involved are also complex, e.g., the ‘department’ of schema S
is considered as a member of the ‘college’ of schema T.
21
Chapter 3. Problem Formulation
grad TA
faculty
ugrad TA
college
Associates
course
Associates
Has-aHas-a
model
(a)
s p o
‘course’ Associates ‘course’‘course’ Associates ‘faculty’‘course’ Has-a ‘ugrad TA’‘course’ Has-a ‘grad TA’
(b)
Figure 3.1: Representation of model. The left plot shows a graphical rep-resentation of a model, comprised of nodes (elements) and edges (relation-ships). The right table shows the tuple representation of edges.
3.1 Representation
In this thesis we consider how to form a generic semantic mapping. Because
we are attempting to solve this problem in a data model neutral fashion that
could be applied equally well to relational or XML schemas or an ontology,
we adopt the terminology from Model Management [17], and say that we
take as input two models1.
A model is a complex design artifact, such as a relational schema, XML
schema, XML DTD, or an ontology, etc. Technically, a model can be rep-
resented as a directed labelled graph (V, E). Specifically, V is the set of
nodes, each denoting an element of the schema, e.g., attributes in relational
database table, type definition in XML schemas, clauses of SQL statement,
etc. E is the set of binary, directed typed edges over V . Formally, each
edge is a tuple < s, p, o >, where s is the source node, p is the type of edge,
and o the target node2, and p denotes the relationship between s and o. An1In what follows, we will use schema and model exchangeably.2The notation < s, p, o > follows the notation of <subject, predict, object> in ontolo-
gies.
22
Chapter 3. Problem Formulation
example of model representation is depicted in Figure 3.1, which illustrates
the concept of ‘course’.
As indicated in [19, 24], in addition to Equivalent relationship, the con-
cepts of generalization/specialization and part-of/whole have been long rec-
ognized as ubiquitous and essential mechanisms in object-oriented modeling
techniques, which have a large scope of applications, such as CAD, man-
ufacturing, software development and computer graphics. In this thesis,
we follow the relationship classification in Vanilla meta-meta model [21],
which embeds the concepts of generalization/specialization and part-of/whole.
Specifically, in the Vanilla meta-meta model, there are five relationship
types, namely Associates, Is-a, Has-a, Contains, and Type-of, In this the-
sis, we concentrate on the first four Associates, Is-a, Has-a and Contains,
where Is-a represents the concept of generalization/specialization, Contains
and Has-a represent the concept of part/whole, and Associates represents
all other weak semantic relationships.
Strictly speaking, though both Has-a and Contains embed the concept
of part-of/whole, they are different in semantics. As indicated in [19], part-
of relationships can be categorized in two dimensions, that is (1) the degree
of sharing of parts among whole objects and (2) the degree of dependence
between some part objects and some whole object(s). Contains and Has-a
are different in the second dimension in that part objects are highly depen-
dent on whole object(s) in Contains, while this dependence is not so strong
in Has-a. This difference brings the rule that in a Contains relationship,
the containee is a part of its container element, and cannot exist on its
own (delete propagation). Moreover, Contains is a transitive relationship
23
Chapter 3. Problem Formulation
and must be acyclic; while Has-a is weaker than Contains in that it does
not propagate deletion and can be cyclic. Since we focus on the high-level
part-of/whole relationship, we treat Has-a and Contains as the same in our
framework.
In addition, we also consider the equivalence relationship, which is the
main focus of previous schema matching approaches. So totally, in our
framework, we consider four relationship types: Equivalent, Has-a, Is-a,
and Associates. Their formal definitions are specified as follows, and their
graphical representation is shown in Figure 3.2:
• Equivalent : E(x, y) means that x is equivalent to y semantically. This
is a symmetric relationship type, i.e., E(x, y) ⇔ E(y, x);
• Has-a: H(x, y) indicates that x has a sub-component/member of y.
This is an asymmetric relationship, i.e., H(x, y) can not infer H(y, x);
• Is-a: I(x, y) means that x is a specialization of y. This is an asym-
metric relationship;
• Associates: A(x, y) indicates that x is associated with y. It is the
weakest relationship that can be expressed. It has no constraints or
special semantics. This is a symmetric relationship type.
This representation is complex enough to capture many of the semantic
relationships that appear in models, and yet is simple enough for a reason-
able initial foray into the problem.
A mapping , MapS−T is a formal description of the semantic relationships
between two schemas, S and T . A mapping itself is a model consisting of a
24
Chapter 3. Problem Formulation
college facultyclassclass
course dept professordept
Associates Equivalent Has-a Is-a
Figure 3.2: Illustration of four relationship types handled by SeMap.
set of mapping elements E , and a set of relationships R on E .
The elements of the two schemas are related through the mapping ele-
ments. Each mapping element e ∈ E is like any other element in schemas S
and T . In addition to being the origin or destination of any kind of relation-
ship found in a model, i.e., R, each e ∈ E can be the origin of one or more
mapping relationships, M(e, s), where s ∈ S ∪ T , which specifies that the
origin element e, corresponds to the destination element s. The semantics
of a mapping relationship is such that for all s1, s2 ∈ S ∪ T s.t. M(e, s1)
and M(e, s2), s1 = s2, and s1 corresponds to s2.
Given this rich mapping structure, the generic semantic relationship,
not just simple correspondences between the elements of S and T can be
expressed in this way: two semantically equivalent elements is represented
by one mapping element; while the relationship of two mapping elements
indicate that between their corresponding schema elements. For example,
in Figure 3.3, the mapping element m1 corresponds to the elements ‘class’
and ‘course’ representing the same concept; the relationship between m4 and
m5 indicates ‘instructor’ ‘is-a’ ‘faculty’.
25
Chapter 3. Problem Formulation
professorfaculty
ugrad TA
dept
m2
(=)
m3
(=)
m4
(=)
m1
(=)
m5
(=)instructor
m8
(=)
m6
(=)
m7
(=)
m9
(=)
grad TA
college
class
schema S
Associates
schema T
course
Associates
Is-a
Is-a
Associates
Is-a
Map S T
TA
Has-a
Has-a
Has-a Has-a
Has-a
Has-a
Is-a
Figure 3.3: An example of complex mapping handled by SeMap.
3.2 Problem Statement
Given the definition of model and mapping, we are now ready to formally
define the goal of SeMap: given two models, S and T , find generic semantic
relationships required to create the mapping S T between S and T .
There may be some optional inputs to the matching process, specifically
(1) an initial mapping S T ′ which provides an initial set of correspondences,
and needs to be refined by the process; (2) external semantic resources r
used by the matching process, e.g., domain-specific thesauri, ontologies, etc.
The matching process is illustrated in Figure 3.4.
Schema T
resource r
Matching Mapping S Tinitial mapping S T′
Schema S
Figure 3.4: Illustration of the matching process.
26
Chapter 3. Problem Formulation
3.3 Semantic Resources
The semantic resources used by matching techniques can be categorized as
internal resources, which is contained in the input schemas or associated
data instances, and external resources, which is the semantic information
not presented in the schemas or data instances.
3.3.1 Internal Resources
The semantic resources of the input schemas include both element-level in-
formation, which refers to the information stored at each schema element
(e.g., element name, data type, structure, etc) and structure-level informa-
tion, which refers to the information contained in the relationships between
schema elements (e.g., relationship type, constraints, etc). In Section 3.3.1.1
and 3.3.1.2, we introduce the element-level and structure-level resources con-
sidered in our SeMap system respectively.
3.3.1.1 Element-Level Information
We consider the following element-level information:
• Element name (label). Each element name is of String type. The name
(label) provides a first layer semantic evidence of the possible meaning
of this schema element.
• Element type. If an element contains data, it is usually associated
with a type indicating the storing format of the data. Note that in
many representations, the data type of an model element is consid-
ered as a separate element, which is linked to the element itself by a
27
Chapter 3. Problem Formulation
Type-of relationship. In our system, we consider data type as an at-
tribute of the model element, e.g., String is an attribute of the element
‘professor’, rather a separate element. The element type can provide
hints in the sense that similar schema elements usually have the same
or compatible data types.
• Element description. It is a short description of the semantic mean-
ing of the element, which usually contains more information than the
element name only. For web interface where only schema-level infor-
mation is available, the element description is especially valuable in
determining the exact semantics of the elements. For example, it is
hard to tell the semantics of an element only by its name ‘people’, in
a flight ticket booking website. However, with the help of its descrip-
tion of ‘total passengers’, one can conclude that ‘people’ stands for the
overall number of tickets bought.
• Data instances. As discussed in Chapter 2, data instances can provide
valuable information that could not be found in schemas, e.g., precise
data format, data distribution, statistical values, etc. Specifically, the
data type of an element may not be exactly how its data is stored,
which can only be found in data instances. For example, the element
‘phone’ may be of an Integer type. However, if looking at its data
instances, one may notice that its exact format is of ‘xxx-xxx-xxxx’,
which is not reflected in its data type. Meanwhile, the distribution
of data instances is also useful in identifying similar schema elements,
especially when the element names are obscure, e.g., A1 and B2 [11].
28
Chapter 3. Problem Formulation
3.3.1.2 Structure-Level Information
In addition to the element-level information discussed above, we also con-
sider structure-level information. In our system SeMap, we mainly consider
two types of structure-level evidence:
• Relationship Type. Each edge between two schema elements is of
certain relationship, which can be leveraged in matching process. The
basic intuition is that if two elements are semantically similar, the
elements having the same relationship with them are also highly likely
to be semantically related.
• Constraints. Each edge can have constraints, including (1) cardinality
in relational database table, e.g., 1-n, 1-1, etc, (2) key properties of
elements, e.g., unique, primary, etc.
3.3.2 External Resources
Previous work on matching techniques has shown that internal semantic evi-
dence is usually insufficient for achieving high quality matching results; some
additional external resources should be leveraged to improve the matching
quality.
In SeMap, we consider two types of external resources:
• Thesaurus. It is a dictionary which provides the different represen-
tations of the same concept. Hence the element names can be first
populated with their synonyms, so that one has a better chance to
find similar elements. Specifically, SeMap uses WordNet as the the-
saurus. WordNet is a comprehensive English lexical reference system,
29
Chapter 3. Problem Formulation
which organize more than 60000 nouns, 11000 verbs, 6000 adjectives
and 3000 adverbs into synonym sets (synsets). It is considered one of
the most powerful tools for computational linguistics, and has been
used in several matching applications [7].
• Ontology. Ontologies, especially domain-specific ontologies are pow-
erful tools in discovering similar elements, even in identifying their
implicit relationships. However they are not always obtainable. The
collection of ontologies we employed in SeMap is provided from Onto-
Builder [6].
3.4 Approach Overview
In this section, we present an overview of our generic matching system
SeMap. As an implementation of the match operator, SeMap takes as in-
put two schemas (models) S and T , and produces their generic semantic
mapping S T . In addition, SeMap has additional input of external semantic
resource r.
In order to identify the generic semantic relationships between schema
elements, SeMap not only has to identify the correspondences of complex
relationships, but also extract the implicit relationship types. Figure 3.5
shows the basic architecture of this mapping construction system. SeMap
implements this goal mainly in three phases: In the first phase, schema
matching , the candidate matches (correspondences) of generic semantic re-
lationships are identified. Note that most previous work focus on finding
correspondences of Equivalent relationships, while in our work we also have
30
Chapter 3. Problem Formulation
similarityscore
schemassource candidate
matches
Mapt
lineage information
Maps
genericmapping
domain knowledge
matcherschema match mapping
assemblerselector
Figure 3.5: The basic system architecture of SeMap. It takes two modelsand external resources as input, and produces generic semantic mapping. Itconsists of three main parts: the schema matcher, the match selector andthe mapping assembler.
to consider the correspondences of other relationships, which significantly
increases the difficulties; In the second step, match selection, a subset of can-
didate matches are selected to form the complete mapping. In this phase,
we develop a novel probabilistic framework that incorporates both match
uncertainty and domain constraints, and implements match selection as a
constrained optimization problem; Finally, the implicit relationship types of
the selected matches are determined in the third phase, mapping assembling.
To the best of our knowledge, most previous work focuses either on finding
Equivalent relationships or one specific type of relationship, i.e., Is-a. Not
much work has been done on the problem of exploiting different types of
semantic information to identify the generic semantic relationship,
3.4.1 Schema Matcher
The schema matcher takes as input two schemas S and T , and generates
a set of initial matches showing the correspondence between the elements
31
Chapter 3. Problem Formulation
of both schemas, i.e., it constructs the input to the mapping construction
problem. For each initial match, it also produces its similarity score indi-
cating its uncertainty, and lineage information recording how it is identified;
this lineage information is retained to help in mapping construction. While
most matching techniques developed so far support the estimation of simi-
larity score, recording the lineage information for each match is novel, which
is designed specifically to support identifying generic semantic relationships
in the mapping assembler. More detailted discussion of schema matcher is
referred to Section 4.1.
Functionality: The schema matcher consists of a set of base matchers,
as in many composite approaches (see Chapter 2). A base matcher is an
algorithm that looks at some aspect of the model/schema, and generates a
series of candidate matches. Each candidate match shows a 1-to-1 or 1-to-n
correspondence. The schema matcher is responsible for ensuring that (1)
the two schemas/models are input to each basic matcher (where applicable)
and (2) that the output results from each basic matcher are combined to
generate a similarity score and lineage information for each candidate match.
Notations: Candidate match: the schema matcher generates a list of
candidate matches, of the form e → e′, where e and e′ are single elements
in either schema.
Tactics: Schema matching can be considered as a searching problem in a
huge space, i.e., the number of candidate matches is quite large. There are
an unbounded number of functions for combining attributes in a schema,
and each one of them could be a candidate match. To search the space
effectively, we employ a set of base matchers. Each base matcher considers
32
Chapter 3. Problem Formulation
a meaningful subset of the space, corresponding to specific types of semantic
information, e.g., element name, type, etc.
The generic semantic match can be either of the form 1-to-1 or the
form 1-to-n (one source element corresponds to multiple target elements).
However, since the focus of our work is to identify the relationship between
schema elements, rather than finding the exact transformation rules, we
consider the case of 1-to-n matches as n 1-to-1 matches, and identify the
relationship between the source element and each target element, which
significantly simplifies the searching complexity.
There are a number of different kinds of matchers that we can consider.
In the current implementation of SeMap, the base matchers include the
label matcher, the sense matcher, the type matcher, the structure matcher.
SeMap’s architecture is flexible enough that it would be easy to extend it
to include other base matchers.
The partial results produced by the basic matchers are combined to
generate the similarity score and lineage information for each candidate
match. Currently, the similarity score is calculated as a weighted sum of the
similarity estimation in all base matchers. And the lineage information is
the union of all the semantic evidences found by the base matchers.
3.4.2 Match Selector
The match selector is responsible for assigning each schema element to a
match, which as whole has the minimum uncertainty, and outputs two best
match assignments Maps and Mapt for the elements of S and T respectively.
For example, as shown in Figure 3.3, from the perspective of source schema,
33
Chapter 3. Problem Formulation
the Maps may be this set of matches: ‘class’ : ‘course’, ‘professor’ : ‘faculty’,
‘dept’ : ‘college’, ‘instructor’ : ‘faculty’, and ‘TA’ : ‘grad TA’, while the Mapt
may consist of this set of matches: ‘course’ : ‘class’, ‘college’ : ‘dept’, ‘ugrad
TA’ : ‘TA’, ‘grad TA’ : ‘TA’ and ‘faculty’ : ‘professor’. These two mappings
are then merged to form the final mapping as shown in Figure 3.3. More
detailed discussion is referred to Section 4.2.2.
Functionality: Match selector takes as input the list of candidate
matches associated with their similarity scores, and search for the best
global assignments from the set of candidate matches. Technically, from
the pool of candidate matches, it selects a proper candidate match for each
source/target element. Here ‘proper’ means that the candidate match has
high probability (similarity score) and it satisfies user-defined domain con-
straints to the maximum extent.
Notations: Constraint: Constraint refers to the regularities imposed
on the generated mapping. Roughly, in our system, we consider two kinds
of constraints. (1) Schema-independent constraints are those imposed by
the meta-meta data model languages, which express general rules that each
model need to obey. For example, the Is-a relationship must be acyclic. (2)
Schema-dependent constraints refer to those regularities imposed on specific
schemas and data of the sources in the domain.
Tactics: The functionality of the match selector is similar in spirit to the
constraint handler as described in iMAP [4], which applies a set of domain
constraints to select a subset of candidate matches. However, we propose a
novel statistical model to incorporate both match uncertainty and domain
constraints in the same framework, and express the match selection as a
34
Chapter 3. Problem Formulation
constrained optimization problem, for which effective solution is available.
We believe that if user interaction is involved, the accuracy of prediction
can be greatly improved. In SeMap, we apply the technique of active learning
to identify the critical points in selecting the matches where user interaction
is maximally useful, so that the user effort can be significantly reduced.
3.4.3 Mapping Assembler
The mapping assembler combines Maps and Mapt, identifies the relation-
ship embedded in the selected matches, and assembles them into a generic
semantic mapping that includes richer semantic relationships. For example,
by consulting the associated lineage information, specifically, that the labels
of the element ‘TA’ is a substring of that of ‘grad TA’, the two elements are
detected of Is-a relationship. A detailed discussion of this part is referred
to Section 4.3.
Functionality: The best match assignments Maps and Mapt are com-
bined to form a final mapping. Then for each match in the final mapping,
the mapping assembler extracts related semantic evidences from lineage in-
formation to identify the implicit relationship type.
Tactics: The combining of the two match assignments Maps and Mapt
is achieved by ranking each match according to their contribution to the
final mapping, that is (1) the likelyhood of this match (2) the violation
of domain constraints by including this match in the final mapping. The
highest subset of matches are selected to form the final mapping.
To identify the implicit relationship of each selected match, all types of
semantic evidences generated by schema matcher are considered. For each
35
Chapter 3. Problem Formulation
type of semantic evidence, a set of heuristic rules are specified. The results
are then combined to vote for the final decision of the relationship type.
In this chapter, we formalize the problem of creating generic seman-
tic mapping, and give an overview of the schema matching system SeMap.
We propose the techniques that (1) identify the correspondences (candidate
matches) of generic semantic relationships; (2) select the matches from the
pool to form the final mapping; (3) determine the relationship types implicit
in the selected matches. In the next chapter, we present the SeMap system
in detail.
36
Chapter 4
SeMap System
In this chapter, we present the technique details of our schema matching
system SeMap. As shown in Chapter 3, the overall architecture of SeMap
consists of three main parts, the schema matcher, the match selector, and
the mapping assembler, which are responsible for finding correspondences
(candidate matches), selecting a subset of candidate matches to form a map-
ping and identifying the implicit relationships respectively. The three parts
are discussed in detail in Section 4.1 4.2 and 4.3 respectively. To illustrate
how SeMap can produce the generic semantic matches, we show the match-
ing process of the example in Figure 3.3 along introducing the components
of our system.
4.1 Schema Matcher
One problem facing schema matching/ontology alignment is the lack of suf-
ficient semantic evidence to discover matches, which is even more severe if
one intends to identify generic relationships between schema elements. A
key conclusion from previous research is that an effective schema match-
ing tool requires a combination of base techniques, e.g., linguistic matching,
structure matching, detecting overlapping of data instance, etc [22]. Hence
37
Chapter 4. SeMap System
we follow a composite approach in detecting potential correspondences.
4.1.1 Base Matcher
The components of a schema matcher are a set of pre-existing matchers
that exploit any available information, and incorporate base techniques in
a uniform framework, such like COMA [3] and iMAP [2], as discussed in
Chapter 2. In these frameworks, a set of base matching approaches (aka
matchers) are organized as a matcher library. They discover initial corre-
spondences based on different types of semantic evidence, e.g., schema level,
data-instance level, corpus, and ontology.
With the equipment of this set of base matchers, the problem of dis-
covering initial matches can be modeled as a search in the possible space
using various matchers, each of which exploits a meaningful subset of the
space [2].
Based on the level of granularity on which matching is performed, base
matchers can be classified as either element-level matchers, or structure-
level matchers. The former computes mapping elements between individ-
ual nodes, and the latter computes mapping elements between subgraphs.
Named with the semantic resource they exploit, the base matchers we imple-
ment in SeMap include sense matcher, label matcher, type matcher, ontology
matcher, data instance matcher and structure matcher. Among them, the
sense matcher, the label matcher, the type matcher, the ontology matcher
and the data instance matcher are on element-level, while the structure
matcher is on structure-level.
38
Chapter 4. SeMap System
4.1.2 Similarity Score and Lineage Information
In addition to the set of initial matches (correspondences), we also expect the
schema matcher to provide the following information: (1) similarity score.
Each candidate match m is associated with a similarity score Sim(m) ∈[0, 1], indicating the belief about its uncertainty, with 1 meaning perfectly
certain; (2) lineage information. For each initial match, one records the
flow of information in and out of the system, such as assumptions, domain
knowledge, etc. From this lineage information, one can trace how this match
is generated. Lineage information will be valuable in discovering the generic
relationships.
The task of similarity evaluation has been discussed in previous work [2,
3, 4], which exploits various types of similarity information, and employs
learning, statistical or heuristic techniques. The module of recording lin-
eage information is similar in spirit to the explanation module in [2], which
provides explanation for user questions posed on the generated matches,
such as explaining existing match, absent match, or match ranking. How-
ever our work focuses on distinguishing different semantic relationships. The
implementation of similarity evaluation and lineage information recording
in SeMap will be specified along presenting each base matcher.
4.1.3 Element-Level Matcher
Element-level matching techniques analyze the information at individual
elements. At the element level we exploit all the techniques discussed in the
literature [7, 15, 22]. However these techniques can not be directly applied,
39
Chapter 4. SeMap System
instead they must be extended to support the similarity score and lineage
information.
4.1.3.1 Label Matcher
The label-based matcher finds semantically related elements by evaluating
the syntactic similarity of their labels (names). Typically it will find as
similar names ‘Match’ and ‘match’, but not ‘match’ and ‘alignment’.
Before applying strict string comparison, we employ a number of stan-
dard natural language preprocessing procedures that can help greatly im-
prove the results of comparison:
• Case normalization. It converts each alphabetic character in a string
to its lower-case counterpart;
• Soundex elimination. Soundex is an encodings of names based on their
pronunciation instead of their spelling, e.g., ‘4U’ and ‘for you’;
• Digit suppression. It suppresses digits and leaves characters only, how-
ever it needs to be used with care, since there are cases digits are
semantically meaningful, e.g., soundex, or chemical names;
• Canonicalization. This is the procedure of converting names to its
standard form by stemming and some other techniques. This is impor-
tant to deal with symbols with special prefix/suffix [22], e.g., ‘cname’
→ ‘customer name’, and ‘empno’ → ‘employee number’;
• Tokenization. The label may be comprised of a set of tokens, e.g.,
‘french-course’ is segmented as ‘french’ and ‘course’;
40
Chapter 4. SeMap System
• Stopword elimination. It eliminates those frequently-used words that
can be found in a list (usually like, ‘to’, ‘a’, . . .).
The preprocessing steps take as input element labels, and produce for
each label a set of tokens whose conjunction carries its meanings. For ex-
ample, the label of ‘grad-TA’ is converted to the tokens of ‘graduate’ and
‘TA’.
The similarity of two strings can be defined in various ways, e.g., Ham-
ming distance, substring similarity, N-gram distance, edit distance, etc [10].
In our implementation, we adopt the N-gram distance, which is proved
effective in information retrieval research, e.g., Rondo [17]. N-gram distance
works as follows: let ngram(s, n) be the set of substrings of string s of length
n, the N-gram distance of strings s1 and s2 is defined as
δ(s1, s2) =|ngram(s1, n) ∩ ngram(s2, n)|
n ∗min(|s1|, |s2|)
The normalization guarantees the N-gram distance within the range [0, 1].
Let T1 and T2 be the sets of tokens of the two elements respectively, the
similarity score is defined as
Sim(T1, T2) = 1− mint1∈T1,t2∈T2
δ(t1, t2)
i.e., it corresponds to the minimum distance between the tokens of two
elements.
The lineage information recorded by the label matcher includes (1) if
the token sets of the two elements are equivalent; (2) if any tokens of the
41
Chapter 4. SeMap System
two labels overlap; (3) if any tokens of the two elements are prefix/suffix
or substring/string. As an example, in Figure 3.3, the tokens of ‘TA’, ‘grad
TA’ and ‘ugrad TA’ overlap, which is recorded as the lineage information of
matches ‘TA’:‘grad TA’ and ‘TA’:‘ugrad TA’. From such semantic evidence,
one can infer the possible relationships existing between the two elements,
which will be discussed in Section 4.3.
4.1.3.2 Type Matcher
The type of schema elements carries the information of data type (e.g, string,
integer, float, etc), value domains (e.g., integer in the range of [1, 12]), and
key characteristics (e.g., unique, primary, foreign). The type matcher deter-
mines the similarity of schema elements based on such semantic information.
In our implementation, we consider six commonly used basic data types,
namely String, Integer, Float, Date, Nodata, Enum. The first four data types
are self-explanatory. The Nodata type means that the element has no asso-
ciated data instance, which usually appears at non-leaf node in the schema,
e.g., XML tree. Enum is short for Enumerate, e.g., an element may have the
instance from a set of color {‘red’, ‘yellow’, ‘blue’}. Note that all these basic
data types can be combined sequentially to from a complex data type, i.e.,
Composition data type. For example, the address format ‘Street + P.O.Box
+ City’ can be considered as a composition of String, Integer and String.
In implementing the type matcher, we define the possible relationship
between two data types as one of equivalent, compatible and incompatible. We
determine the relationship between two types using the following heuristic
rules:
42
Chapter 4. SeMap System
• The same basic data types are equivalent;
• All the basic data types are mutually incompatible, except that the
Integer type is compatible with that of Float;
• Two composite data types are compatible if one appears as a subse-
quence of the other, e.g., {String, Integer, String} is compatible with
{Float, String, Integer, String}, while {Integer, String, String} is not;
• Two value domains are compatible if one is a subset of the other, e.g.,
positive integer number is compatible with positive float number.
In SeMap, the similarity score produced by the type matcher is defined on
three metrics, data type, value domain (if there is any), and key character-
istics (if there is any). For equivalent data types, it produces 1, compatible
0.5, and incompatible 0.
The lineage information recorded by the type matcher includes (1) if the
data types of the two elements are equivalent, compatible or incompatible;
(2) if the data types are composite, and the number of basic data types
involved; (3) if the value domains of the two elements overlap; (3) if their
key properties (if any) are the same. For example in Figure 3.3, the elements
‘TA’, ‘grad TA’, ‘ugrad TA’, ‘professor’, ‘instructor’ and ‘faculty’ all have the
same type String, which means they are possibly related. While ‘class id’
and ‘course no’ not only have the same data type, but also the same key
property since they are both unique in the two schemas.
43
Chapter 4. SeMap System
4.1.3.3 Sense Matcher
It is common that the same concept has different semantically equivalent rep-
resentations in different schemas. Such implicit similarity can hardly be de-
tected by syntactic analysis techniques. For example, ‘car’ and ‘automobile’
are synonyms, and refer to the same concept; while ‘book’ and ‘article’ have
the same hypernym ‘publication’, and they are considered similar. Such
semantic-based matching is especially valuable for schemas with relatively
flat structure, and other scarce affiliated syntactic information (e.g., data
instances). Moreover, since we are attempting to deal with generic rela-
tionships, this semantic matching technique provides important hints for
detecting the embedded relationships.
Exploiting synonyms and hypernyms however requires the use of thesauri
or dictionaries. In the implementation of SeMap, we employ WordNet as
the dictionary, which is an English lexical database. In WordNet, the senses
(atomic meanings of a word or expression) of words are organized into sets
of synonyms (synsets), and synsets are in turn organized into hierarchical
based on their semantic relationships.
The basic idea of the sense matcher is to first populate the tokens of
each element with their senses in WordNet. The senses of an element is thus
the union of the senses of all its tokens. By comparing their senses, one can
evaluate the similarity of the corresponding two elements.
The possible relationship between two synsets X and Y considered in
our system include:
• Holonym. Y is a holonym of X if X is a part of Y ;
44
Chapter 4. SeMap System
• Hypernym. Y is a hypernym of X if X is a (kind of) Y ;
• Hyponym. X is a hyponym of Y if X is a (kind of) Y ;
• Meronym. X is a meronym of Y if X is a part of Y .
We measure the similarity score of two elements in terms of the rela-
tionship of their senses, as defined as follows: (1) 1 if there is at least one
sense of the first label which is the same or a synonym of the second, i.e.,
they share the same synset; (2) 0.5 if there exists at least one sense of one
label that has a sense of the other as a hypernym, a holonym, a hyponym
or a meronym; (3) 0 if two labels share no sense in common. The lineage
information records the number of relationships detected for each kind of
holonym, hypernym, hyponym and meronym. For example, in Figure 3.3,
‘course’ and ‘class’ are synonyms; one sense of ‘department’ is a meronym of
that of ‘college’; while both ‘professor’ and ‘instructor’ have senses which are
hyponyms of that of ‘faculty’.
Note that within the context of the schema, the token may have only a
small subset of its possible senses. The pruning of those fake senses depends
on other matchers. This pruning will be explored in our future work.
4.1.3.4 Ontology Matcher
Domain-specific ontology provides a powerful tool to identify related schema
elements. The basic idea is to first map the schema elements to their counter-
parts in the domain-specific ontology, (which may requires enhancing schema
elements by populating with their semantically equivalent representations),
45
Chapter 4. SeMap System
and then identify the relationships of schema elements by that between their
counterparts in the ontology.
4.1.3.5 Data-Instance Matcher
Instance-level data provides important insight into the semantic meaning
of the schema elements. This is especially valuable when the schema-level
information is limited. Even when substantial schema information is avail-
able, the use of data instance-level matching can also be used to uncover
incorrect interpretations of schema information.
Specifically, some useful matching techniques that can be applied to data
instances include: (1) For text data type, information retrieval techniques,
such as linguistic characterization evaluates the similarity of two schema
elements by comparing the relative frequencies of words and combinations
of words in their data instances; (2) For numerical data type, statistical
characterization such as numerical value ranges and averages, can provide
insight into the similarity of the corresponding schema elements.
The main benefit of evaluating instances is a precise characterization of
the actual contents of schema elements, which can moreover, help enhance
those schema-level matchers. For example, the value ranges found by the
data-instance matcher can improve the effectiveness of the type matcher.
4.1.4 Structure-Level Matcher
While element-level matcher provides background or context information
for each pair of elements, one still needs to consider the matching problem
within the picture of the whole schema, which is achieved by a structure-level
46
Chapter 4. SeMap System
matcher.
Within the structure-level matching techniques, each schema is viewed as
a labeled graph, representing schema elements and their inter-relationships,
and the similarity comparison between a pair of nodes is based on their po-
sitions in their graph and their relationship with neighboring nodes. The
intuition behind structure-level matcher is that if two schema elements from
two schemas are highly related, their neighbors having same relationship to
them might also be related to some extent. Matching graphs is a combina-
torial problem that is computationally expensive. It can usually only solved
by approximate approaches.
The structure-level matcher employed in SeMap is Similarity Flood-
ing [15]. It solves the optimization problem by a fixed-point algorithm.
Based on an initial mapping generated by other matching techniques, simi-
larity flooding propagates the similarity of mapped elements to adjacent ones
which have similar relationship to the mapped elements. The algorithm ter-
minates after a fixed point has been reached. The output of the similarity
flooding is a similarity score for each pair of schema elements from the two
schemas. For example, in Figure 3.3, if the elements ‘class’ and ‘course’ are
detected as highly similar, then the elements ‘TA’ (which has Has-a relation-
ship to ‘class’), and ‘ugrad TA’ (which has Has-a relationship to ‘course’) are
also likely to be similar. Note that structure-level matcher is based only on
initial mapping information and structure of graph, and does not consider
any other semantic information. Its output does not contain any lineage
information.
47
Chapter 4. SeMap System
siminitial
label
sense
type
token
sense
type
com
bin
ing matcher
simfinal
lineage infopreprocessing
WordNet matcher
labelmatcher
matcher
structure
...
Figure 4.1: Architecture of schema matcher. It consists of three layers, basematcher, combining layer and structure matcher.
4.1.5 Architecture of Schema Matcher
Instead of applying the matchers discussed above individually, we explore
their interaction in order to achieve a best matching quality. The basic archi-
tecture of the schema matcher is illustrated in Figure 4.1. It mainly consists
of three layers, element-level matcher, combining layer and structure-level
matcher.
Element-level matchers include the label matcher, the sense matcher,
the type matcher, etc, which find the similarity between schema elements
by exploiting their element-level information. Note that some element-level
matchers need preprocessing step to facilitate matching. For example, the
preprocessing step converts the labels of elements into meaningful tokens,
which can then be processed by the label matcher; the tokens of each element
should be first populated with their senses in WordNet, in order to be used by
the sense matcher. The output of each element-level matcher is (1) similarity
score and (2) lineage information, as discussed in previous section.
In the combining phase, the similarity scores produced by the element-
level matchers are integrated to form a unified score. The combining scheme
48
Chapter 4. SeMap System
can be of various forms, as long as it guarantee that the final result is well
normalized, i.e., in the range of [0, 1]. In the implementation of SeMap, we
follow a simple linear combining scheme, that is, each element-level matcher
em is associated with a weight wem, and the similarity score Simem produced
by that matcher is damped by that weight when computing the unified score.
Formally, the similarity score Sim =∑
em wemSimem, where∑
em wem = 1
guarantees that the result is well normalized. This set of weight parame-
ters must be tuned carefully in order to achieve the optimal result. And
the tuning of the parameters is problem-specific, i.e., specific for different
input schemas, which is not the focus of this thesis. An automatic learning
approach is referred to [27].
As an example, in Figure 3.3, the elements ‘instructor’ and ‘faculty’ are
detected as equivalent by the type matcher, similar by the sense matcher,
and unsimilar by the label matcher. If all the mathers have the same weight,
and the equivalent, similar and unsimilar matches have similarity scores 1,
0.5, and 0 respectively. Then ‘instructor’ and ‘faculty’ have the unified score
(1 + 0.5 + 0)/3 = 0.5.
The unified similarity scores generated in the second phase are fed into
the structure-level matcher as initial mapping. The structure matching is
then performed to produce the final similarity score for each pair of elements.
In conclusion, for each potential match, the schema matcher produces
(1) similarity score, indicating the uncertainty about this match, and (2)
lineage information, recording how this match is detected.
49
Chapter 4. SeMap System
4.2 Match Selector
As shown in the SeMap architecture, Figure 3.5, the matcher selector is the
second main part of the architecture, which is responsible for selecting a sub-
set from the pool of initial matches produced by the schema matcher. It also
contains a user interaction module which exploits user feedback to improve
the mapping quality. In this chapter, we introduce a novel probabilistic
framework, which allows us to express the match uncertainty and domain
constraints in a uniform way, and match selection can then be transformed
as an optimization problem, shown in Section 4.2.2. Within this framework,
we reduce the need of user interaction and focus user attention by identify-
ing critical points where user feedback is maximally helpful, as discussed in
Section 4.2.3.
Given the pool of initial matches and associated similarity scores, the
match selector searches for a global optimal match assignment that satisfies
a set of domain constraints, e.g., in the case of Figure 1.1, user may specify
that each class has only one instructor, then mapping ‘instructor’ to two
elements will violate this constraint.
Most prior work studied this problem in the case of 1-1 correspondences.
They either first apply the constraints to narrow down the pool of possible
mappings, and then transform match selection to a stable marriage problem
in a bipartite graph [16], or assume that the initial matches are mutually
independent, and seek a trade-off between the uncertainty and constraints
satisfaction [4].
As indicated above, the similarity scores represent our belief about the
50
Chapter 4. SeMap System
initial matches, and the semantic constraints represent our preference for
certain matches, i.e., they are both probabilistic in nature. Hence it is
natural to adopt a probabilistic model to express match uncertainty and
constraints in a uniform way. In this thesis, we present such a framework,
which allows us to in turn model match selection as an optimization prob-
lem. In Section 4.2.1, we introduce the representation of this probabilistic
framework, and discuss the match selection problem and its solution within
this framework in Section 4.2.2.
4.2.1 Representation
In this section we show how to incorporate match uncertainty and semantic
constraints in a probabilistic framework. This approach is inspired by the
work of [12]. As a research proposal, [12] indicates the need of a probabilis-
tic model to express this uncertainty. Our work can be considered as an
implementation of this idea, moreover, we extend this framework to support
domain constraints. Hence it is a contribution to both schema matching and
mapping construction.
Formally, each schema element e is associated with a set of initial matches
Me, and can be assigned to a match m ∈Me. The probability of assigning
e to match m∗ ∈Me is defined as:
P (e = m∗) =Sim(m∗)∑
m∈MeSim(m)
where Sim(m) is the similarity score of match m provided by the schema
matcher. Intuitively, this represents the preference for matches with lower
51
Chapter 4. SeMap System
uncertainty. It is easy to verify that this model is well normalized:
∑
m∈Me
P (e = m) = 1
Each mapping consists of a set of matches, of which some may violate
the domain constraints specified by the users. To achieve the best quality
mapping, we take consideration of domain constraints in selecting matches.
Specifically, we associate each constraint c with a penalty function Πc(M),
checking whether a match assignmentM violates c and returning the degree
of violation. Describing Πc(M) is beyond the scope of this thesis, but it
must increase as the severeness of the violation. Each constraint c is also
assigned a weight αc, indicating its strictness. The weights can be hard
coded or learned from known mappings [14]. Given a set of constraints C,the probability of a set of elements E taking match assignment M (where
ei ∈ E is assigned to mi ∈M) is defined as:
P (E = M|C) =1Z
[∏
ei∈EP (ei = mi)](
Pc∈C −αcΠc(M))
where Z is a normalization constant, in order to guarantee that P (E =
M|C) ∈ [0, 1].3 In this equation, the total likelihood of the match assignment
for E is captured in∏
ei∈E P (ei = mi), while the exponential part represents
the penalty of violating the constraints. The more constraints that are
violated, the lower the probability is.3Formally, Z =
PMQ
ei∈E P (ei = mi)(P
c∈C −αcΠc(M))
52
Chapter 4. SeMap System
4.2.2 Bidirectional search
We intend to express the match selection as a well-defined mathematical
problem, so that existing formal approaches can be used. Within the prob-
abilistic framework introduced in Section 4.2.1, one can model match selec-
tion as an optimization problem, and a rich set of tools can be applied to
the problem. Since in this framework, the joint probability of the selected
matches measures the uncertainty of the mapping, for a set of schema ele-
ments E , match selection amounts to finding the match assignment M that
maximizes the probability P (E = M|C) under constraints C.It is infeasible to consider all possible combinations from the domain of E
to find the optimal match assignment. For example if each e ∈ E is associated
with k initial matches, one potentially has to consider |E|k combinations. It
is highly likely that E comprises a series of disjoint subset E1, E2, . . ., which
are mutually independent, i.e., in graphical model, the graph of E consists
of a set of separated sub-graphs. Hence we can optimize these independent
parts separately, that is
maxM
P (E = M|C) ≡ maxM1,M2,...
∏
i
P (Ei = Mi|C)
However it is possible that the size of Ei ∈ E can still be quite large,
leading us to considering more efficient solutions, including several effective
graphical model optimization algorithms proposed in machine learning com-
munity [9], e.g., graph-cuts, and several heuristic searching methods, e.g., A∗
algorithm. In the implementation of SeMap, we apply the A∗ algorithm [8]
to this optimization problem. A∗ algorithm is a graph search algorithm that
53
Chapter 4. SeMap System
finds a path from a given initial node to a given goal node (or one passing a
given goal test). It employs a heuristic estimate that ranks each node by an
estimate of the best route that goes through that node. It visits the nodes
in order of this heuristic estimate. In our implementation, each schema el-
ement is considered as a node, and the candidate matches are the possible
paths to the schema elements on the other side. The value of each path is
heuristically estimated, and a globally optimal assignment is selected.
The discussion above focuses on match selection for an arbitrary set
of elements E . In the context of complex match, the perspectives from
source and target schemas could be significantly different. For example, in
Figure 3.3, starting from the side of Schema T , one may discover that ‘grad
TA’ is a specialization of ‘TA’, but miss the fact that ‘TA’ consists of both
‘grad TA’ and ‘ugrad TA’.
To simplify the problem, one is willing to distinguish the perspectives
of source and target schemas, i.e., treat the match selection for source and
target elements separately. Thus instead of searching the best matches for
source elements Es or target elements Et separately, as in previous work, e.g.,
iMAP [2], one runs the optimization algorithms for both source elements Es
and target elements Et, i.e., bidirectional search. The result of bidirectional
search are two sets of matches Maps and Mapt. An example of Maps and
Mapt is shown in Figure 4.2, where from the side of S, the element ‘TA’
is assigned the correct match, however ‘instructor’ is not; while from the
perspective of T , the case is just opposite. Maps and Mapt will then be
merged to form a final complete mapping. The merge operation will be
discussed in Section 4.3.
54
Chapter 4. SeMap System
ugrad TA
grad TA
professor
facultyinstructor instructor
faculty
grad TA
schema T
TA
TA
schema Sschema S schema T
Figure 4.2: Partial match assignments from the perspectives of source andtarget schemas respectively.
4.2.3 Modeling user interaction
Capturing user feedback is crucial for improving mapping quality. Though
user feedback is used for error correction, mapping refinement, etc. [32],
modeling user interaction is a comparatively under-studied problem. Fol-
lowing we show that the probabilistic framework discussed above can be
extended to modeling user interaction.
The key to modeling user interaction is to identify those critical points
where feedback is maximally helpful, so that user attention can be focused on
important problems, and user workload is minimized. We present an active
learning solution to simulating the effect of user’s selection of candidate
match for schema elements. This is an extension of the approach discussed
in [31], where active learning is applied to learning the optimal parameters
in matching web interface. In our approach, the model elements are ranked
based on their potential information value, and user feedback is asked for
on those most informative ones.
A natural measurement of information value is entropy, which represents
the uncertainty about a signal or random invents. Formally, for a variable
55
Chapter 4. SeMap System
x with distribution P (x), its entropy H(x) is defined as
H(x) = −∑
x
P (x) log P (x)
Intuitively, entropy is a measure of randomness or uncertainty. The higher
the uncertainty, the larger the entropy value. In the context of match se-
lection, the entropy corresponding to each schema element measures the
uncertainty about its match assignment, which however does not take ac-
count of its influence on match assignment of other schema elements. This
deficiency of entropy leads to another metric, mutual information (IM):
I(x, y) = H(x)−H(x|y) = H(y)−H(y|x)
Intuitively, the mutual information of two random variables x and y, I(x, y)
expresses the reduction in the uncertainty about x by virtue of being in-
formed the value of y (or vise versa). Hence one seeks the element having the
maximum mutual information with other elements, i.e., the one whose match
assignment can maximally help identify the best match for others. Formally,
one seeks the most informative element e that maximizes∑
e′∈E I(e, e′). The
joint probability of P (e, e′|C) can be calculated according to Equation 4.2.1,
which will yield the most informative element.
Once the most informative element has been selected, this element should
be disambiguated by the user, which is the key tenant of active learning: the
system should prompt the user to disambiguate cases where the user’s input
will provide the most information to the system. Once the user’s selection on
56
Chapter 4. SeMap System
the most informative element is obtained, the setting of the match selection
problem needs to be updated accordingly. Specifically, assuming that the
match assignment of an element e is fixed, e.g., e = m∗, then P (e = m∗) = 1,
and P (e = m) = 0 (m 6= m∗). Based on the new belief about e, the joint
probability of variables is updated accordingly. Note that only the joint
probability of the subset of E involving e needs to be updated.
In this section, we present a novel probabilistic framework, which incor-
porates both match uncertainty and domain constraints in a uniform way.
The match selection is solved as a constrained optimization problem, and
user interaction is significantly reduced by identifying those most critical
points.
4.3 Mapping Assembler
The bidirectional search (Section 4.2.2) produces two sets of matches Maps
and Mapt, representing the correspondences from the perspectives of source
and target schemas respectively. The following component of the mapping
construction process is the mapping assembler, which is the final phase of
SeMap system, as shown in the architecture diagram, Figure 3.5. In the
mapping assembler, Maps and Mapt are combined to form a final generic
semantic mapping. Specifically, in this process, we aim to solve the fol-
lowing problems: select an optimal set of matches from both mappings
(Section 4.3.1); identify the relationship implicit in the selected matches
(Section 4.3.2); assemble these matches together to form a final, generic
semantic mapping (Section 4.3.3).
57
Chapter 4. SeMap System
4.3.1 Combining Maps and Mapt
By using our novel bidirectional searching to create Maps and Mapt, we
have considerably narrowed the number of matches that must be considered
in order to build our generic semantic mapping. Thus, it results in a much
smaller search space to be examined in determining semantic relationships
(such as the ones in Figure 3.3). To merge the Maps and Mapt to form a
final mapping, we present a heuristic approach, which our preliminary tests
have shown to work effectively in practice.
Let M denote the set of matches in Maps and Mapt. For each match
m ∈M, one calculates the reward of including it in the final mapping:
R(m) = Sim(m)(P
c∈CP
m′∈M−αcΠc(m,m′))
Intuitively, this reward function takes account of the similarity score of the
match, and the constraints it violates together with other matches in M.
One then filters those matches with reward R(m) lower than certain thresh-
old ε.
4.3.2 Identifying relationships
One key step of SeMap system is identifying the relationship implicit in
the selected matches. Since we intend to create generic semantic relation-
ships, there is little previous work done on this problem. In SeMap, we
propose a rule-based method to identify the implicit relationships. Because
the matches are identified by various schema matching techniques, based on
different semantic evidence, a uniform solution is hard to obtain. Instead,
58
Chapter 4. SeMap System
for each type of semantic evidence, we define specific rules to extract the
generic relationship embedded inside. Following, we show how to identify
the four specific relationships, Equivalent, Has-a, Is-a and Associates with
the help of semantic evidence.
We classify the semantic evidence into three categories, schema-level
(e.g., label, type, structure), semantic-level (e.g., sense), instance-level(i.e.,
data-level), and ontology. Usually a match is associated with semantic ev-
idence from multiple categories. We describe each category in more detail
below.
Schema-level evidence includes label, type, and structure information,
etc. Generally speaking, schema-level information alone is insufficient to
determine the embedded generic relationship. However, it provides support
for the results claimed by other semantic evidence, from which it can further
help deduct implicit relationships. In our implementation, corresponding to
each kind of lineage information, we applied the following heuristic rules:
• Label. Two elements with Equivalent relationship are likely to share
similar names; Two elements with Is-a or Has-a relationships tend
to have labels with prefix/suffix relationships; e.g., For example, from
the fact that ‘grad-TA’ has a substring/suffix of ‘TA’, we know that
probably that ‘grad-TA’ is a part or a specification of ‘TA’, which
depends on other type of semantic information;
• Type. Two elements with Equivalent or Is-a relationship probably
share the same data type, while if one element has a data type as a
subcomponent of that of the other, it is likely that they have Has-a
59
Chapter 4. SeMap System
relationship;
• Structure. Two non-leaf nodes, or two leaf nodes possibly have the
relationship of Equivalent or Is-a, while one non-leaf node and one
leaf node are likely to be of Has-a relation.
Semantic-level evidence is the sense of the label (name) of the schema
element. Based on the lineage information produced by the sense matcher,
one can infer the embedded semantic relationships. The heuristic rules are
as follows:
• If two elements share some same sense, it is highly possible that they
are semantically similar or equivalent;
• If some senses of one element act as the hypernym of that of the other,
then they may be in a Is-a relationship;
• If some senses of one element are the hyponyms of that of the other,
the two elements probably are of Is-a relation;
• If some senses of one element appear as the holonym or meronym of
that of the other, they can be in a Has-a relationship;
Instances (i.e., data) give the entity-level clues about the relationship
between schema elements. Hence instance-level evidence usually precisely
characterizes the actual content of schema elements. By studying the sub-
sumption, or distribution similarity of the data instances of schema elements,
one can discover the relationship that is difficult to identify on the schema-
level, due to the differences of schema designs.
60
Chapter 4. SeMap System
• If two elements have similar data distribution, they are likely to be
‘equivalent’;
• If the instance of one element x subsumes that of another element y,
it is likely that x ‘has-a’ y as its member;
• If the instances of x and y intersects, it is possible that x ‘associates’
with y.
A domain-specific ontology provides alternative representations of con-
cepts in the domain, and their possible relationships. Since they directly
provide the information about the relationships of the identified matches, it
is trivial to refine these relationships to generic ones. For example, one can
simply follow certain mapping between the two relationship classification
systems, e.g., that between ontology modeling and meta-data management,
and convert the relationship from ontology/corpus to our generic relation-
ship types.
Note that a match is usually associated with various semantic evidence,
which possibly indicate however different types of generic relationships. In
combining the results suggested by various semantic evidence, we follow a
voting scheme: each type of evidence is associated with a weight, and a
unified conclusion is obtained by linearly combining these results as in the
case of schema matcher (Section 4.1.5). Generally speaking, the weights
are set according to the relative plausibility of semantic evidences. In our
system, we regard the types of semantic evidence has the following rank:
ontology > instance > sense > schema information, and set the weights
accordingly.
61
Chapter 4. SeMap System
schema T schema S
ma
(=) aa
(=)
(=)
(=)
b
b1
b2
mb
mb1
mb2Is-a
Is-a
Figure 4.3: Mapping assembling for matches of different types. Each 1-1equivalence match corresponds to one mapping element, while each elementof complex match is associated with one mapping element.
As an example in Figure 3.3, for the match of ‘professor’ and ‘faculty’, the
sense evidence suggests Is-a relationship, the type evidence indicates Equiva-
lent or Is-aequally, while the label and structure evidence provide no advice.
Assuming the four types of evidence have weights 0.4, 0.3, 0.2 and 0.1 re-
spectively, the voting result correctly identifies the match ‘professor’:‘faculty’
as Is-a relationship.
4.3.3 Assembling mapping
Finally, given that we have discovered the semantic relationships as in Sec-
tion 4.3.2, we must create the final mapping. In order to ensure that the final
mapping meets the specifications in the problem definition (Chapter 3), we
apply the following rules to create mapping elements: (1) For 1-1 equivalent
match, one mapping element is sufficient to represent them in the mapping;
(2) For 1-1 non-equivalent match or complex match, one mapping element
is created for each element of the match; (3) For each of other input element
not involving in match, one corresponding mapping element is created. All
62
Chapter 4. SeMap System
these cases are illustrated in Figure 4.3. Note that it is possible to iden-
tify the relationships between the mapping elements belonging to different
matches, in order to form a model-like complete mapping, using approach
similar to the merge operator [21], though it is not the focus of this thesis.
63
Chapter 5
Experimental Analysis
To evaluate the effectiveness of our schema matching system, we applied
SeMap to several real-world domains. Specifically, the experiments are per-
formed with two main goals:
• To evaluate the matching accuracy of our schema matching system.
Since our goal is to detect the complex relationship existing between
schema elements, the matching accuracy includes not only the corre-
spondences as measured in most previous schema matching systems,
but also the detected relationship types.
• To measure the relative contribution of different system components to
the result. Specifically, we are interested in measuring the performance
gain from (1) different base matchers (2) match selector including user
interaction.
This chapter presents the empirical analysis of the SeMap system. We
first describe the experimental setting in Section 5.1. This includes (1) the
dataset used in experiments (Section 5.1.1); (2) the expert mappings (Sec-
tion 5.1.2); (3) the metrics to evaluate matching results (Section 5.1.3) and
(4) the experimental methodology (Section 5.1.4). The evaluation results
of matching accuracy are then presented in Section 5.2.1. Next we show
64
Chapter 5. Experimental Analysis
the relative contribution of each system component to the final matching
results in Section 5.2.2. Finally, based on the evaluation results, we analyze
the strengths and weaknesses of our approach in Section 5.2.4.
5.1 Experimental Setting
5.1.1 Data Set
We evaluated SeMap on both synthetic and real datasets. The synthetic
dataset is the example shown in Figure 3.3. The real-life datasets are from
two domains, namely Real Estate, and Course Information. All the real
datasets are imported from the Illinois Semantic Integration Archive [29].
Specifically, the Real Estate dataset is a set of schemas describing the infor-
mation of houses for sale; And the Course dataset is a set of schemas on the
information of courses offered across different universities. All the schemas
used in evaluation are included in the Appendix A of this thesis. Some of
these schemas are associated with data instances (Real Estate and Course
Info). In our experiments, for the schemas with data instances, we exploit
the exact data format of the elements by looking at their data instances
in addition to the schema-level information. For schemas without data in-
stances (Synthetic), we evaluated our approach on schema-level information
only.
Since all the original real-life schemas are in DTD format, we performed
a converting operation to import the schemas into our model representation.
We faithfully mirror the structure and terms from source schemas. Because
in the model representation, every edge between two elements is of certain
65
Chapter 5. Experimental Analysis
domain schema # leaf/non-leaf # relationshipHomeseekers 25/3 27
Real Estate Texas 31/3 33Yahoo 23/2 24Reed 12/3 14
Course Info Rice 11/4 14UWM 15/4 18
Synthetic course 5 /1 5class 5/1 5
Table 5.1: Characteristics of the input schemas.
relationship type, while it is not the case in DTD, we set all the relationships
as Has-a by default. For example, if the schema in Figure 3.1 is represented
as DTD, which has no relationship type between the element ’course’ and
any other elements, SeMap by default regards all these links are of Has-a
relationship.
In preparing the data, it is necessary to apply some trivial data cleaning
operation, such like splitting ‘custno’ into ‘cust’ and ‘no’. This is not the
focus of our SeMap system, so we will not discuss it further in the thesis.
More detailed discussion of data cleaning can be found in [23].
5.1.2 Expert Mapping
In preparing the real-life datasets, we extracted three schemas from each
domain. We chose the schemas with complex structure, and among which
complex relationships can exist. The characteristics of these schemas after
preprocessing step are shown in Table 5.1, including the number of elements
(leaf and non-leaf), the number of relationships, the maximum depth of the
tree, etc.
66
Chapter 5. Experimental Analysis
# relationshipdomain schema S/T Equivalent Is-a Has-a total
Real Estate Homeseekers/Texas 18 6 12 36Homeseekers/Yahoo 20 0 11 31
Course Info Reed/Washington 11 0 7 18Reed/WSU 18 1 8 27
Synthetic course/class 2 4 3 9
Table 5.2: Characteristics of the expert mappings.
For each pair of schemas in the same domain, we created an expert
mapping, which act as the ‘standard answer’ to the matching problem. The
characteristics of the expert mappings are shown in Table 5.2. The table
shows the total number of matches, the number of matches of each specific
relationship type, also the percentage of elements involved in mapping from
both source and target schemas. Note that in expert mapping, we do not
consider the matches with the relationship Associates, because of its weak
semantics, though in matching, SeMap may find the Associates relationship,
when it lacks sufficient semantic evidence.
5.1.3 Evaluation Metrics
Following previous work on schema matching, we evaluated the performance
of our approach on three metrics: recall, precision, and F-measure [25]. Pre-
cision P represents the percentage of correctly identified matches over all
matches identified by the system; Recall R is the percentage of correctly
identified matches over the all correct matches in the given expert map-
ping. Formally, let #detected be the number of correct matches detected,
#mapping the total number of correct matches in the expert mapping, and
67
Chapter 5. Experimental Analysis
#result the total number of matches in the results, the recall R and preci-
sion P is defined as:
R =#detected
#mappingP =
#detected
#result
Recall and precision are inversely related, hence it is desirable to have one
measurement to incorporate both recall and precision. F-measure F (pre-
cisely F1 measure) and equally weights recall R and precision P :
F =2PR
P + R
As discussed above, we deal with not only correspondences, but also the
implicit relationships. Thus a correct match means (1) the found corre-
spondence is correct, (2) the identified relationship is exactly as that in the
expert mapping. So we measured the matching accuracy in two-fold: (1) the
total number of correct correspondences detected; (2) the number of correct
matches for each type of relationships, e.g., Equivalent, Is-a, etc.
5.1.4 Experimental Methodology
For each domain, we performed three sets of experiments, i.e., between
each possible pair of schemas. We first evaluated the matching accuracy
of SeMapfor each sets of data, and investigated how sensitive it is against
the setting of parameters; We then evaluated the relative contribution of
each component and user interaction to the final mapping.
Currently, the parameters needed to be tuned in our system are the
68
Chapter 5. Experimental Analysis
weights of different semantic evidences as produced by the various match-
ers, i.e., label (λl), type (λt), sense (λs), etc. The influence of these weights
parameters are referred to Section 4.1.5. The specific setting of parameters
for different datasets are as follows: (1) For the Real Estate dataset, the
parameters are set as λl = 0.3, λt = 0.5, and λs = 0.2, reflecting the ob-
servation that in the Real Estate dataset, the name, type and sense are all
important to identifying the implicit relationship, however due to a number
of abbreviation, e.g., ‘ac’ for ‘air conditioner’, the type information is espe-
cially important to finding the hidden correspondences; (2) For the Course
Info dataset, λl = 0.4, λt = 0.4, and λs = 0.2, which means the name and
type convey equally significant semantic information, and the sense carries
less importance weight due to a small quantity of synonyms, hyponyms, etc
in this dataset; (3) For the synthetic dataset, λl = 0.4, λt = 0.3, and λs
= 0.3. That is all types of semantic information carry equally importance
weights.
5.2 Experimental Result
We present in the following the matching performance of the SeMap system
on the synthetic and real datasets. We first show the matching accuracy
of the system exploiting all the semantic information, and then analyze the
contribution from different types of semantic evidences. Finally we measure
the performance gain from incorporating user interaction into the match
selector phase.
69
Chapter 5. Experimental Analysis
Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 20
20
40
60
80
100Matching Accuracy
Rec
all (
%)
Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 20
20
40
60
80
100
Pre
cisi
on (
%)
Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 20
20
40
60
80
100
F−
mea
sure
EquivalentIs−AHas−AOverall
Figure 5.1: Matching accuracy of SeMap. The three plots show the recall,precision and F-measure of the matching results for the three relationshiptypes Equivalent, Has-a, Is-a and total correct matches respectively.
5.2.1 Matching Accuracy
Figure 5.1 shows the matching results of SeMap over the five datasets, mea-
sured using the metrics recall, precision and F-measure. As discussed above,
we measured both the accuracy of identified matches, and identified rela-
tionships. The four bars (from left to right) show the matching accuracy
for the three relationship types Equivalent, Has-a, Is-a (correct match and
correct relationship) and total number of correct matches (not including the
relationship) respectively.
The result shows that SeMap achieved a high average matching accuracy
not only in detecting the correct correspondences (total number of correct
70
Chapter 5. Experimental Analysis
matches), but also in detecting the implicit relationships. Take F-measure
as an example, the percentage of correct correspondences range from 70%
to 100%, and the average accuracy of detected relationships are 79%, 69%
and 64% for the three relationships respectively. The accuracy of correspon-
dences detected is comparable to the results claimed in iMAP [2] (Tested
on Real Estate dataset, 77-100% for 1-to-1 matches and 50-86% for 1-to-
n matches), and that produced by [5] (Tested on Real Estate dataset and
Course Info dataset, 73% recall and 67% precision without domain ontology,
and 94% recall and 90% precision with domain ontology).
Note that our system did not resort to any domain ontology or data
frames as used in [5]. Moreover, the problem we are tackling is more dif-
ficult than these works in the sense that we have to not only identify the
correct correspondences, but also extract the generic semantic relationships
implicit in the correspondences. In Section 5.2.4, we will identify the reasons
that prevent SeMap from achieving higher accuracy in identifying correspon-
dences and extracting implicit relationships.
Looking at these results in further detail, we note that in the Real Es-
tate 1 and Course Info 2 datasets however, SeMap has low precision (about
40% for Real Estate 1 and 20% for Course Info 2) in identifying the Has-a
relationship. To further explore why this is the case, we provide more de-
tailed analysis of detailed composition of the matches identified by SeMap,
i.e., of the matches identified as each type, how many are (1) correct corre-
spondence and correct relationship (2) correct correspondence but incorrect
relationship (3) incorrect correspondence (non-match). The result is shown
in Figure 5.2, where for each type of relationship, we look into the compo-
71
Chapter 5. Experimental Analysis
sition of matches as identified by SeMap. It can be noticed in both case
of Real Estate 1 and Course Info 2, incorrect correspondences are the main
cause of the low precision in identifying matches of Has-a relationship. By
barring those incorrect correspondences, the precision of SeMap in identi-
fying semantic relationships is much more higher. The accuracy of SeMap
after barring incorrect correspondences is shown in Figure fig:impaccuracy,
where the precision reaches near 100% over most datasets. Note that the
synthetic dataset is very small, despite its low precision in this case, SeMap
misclassified only one match. We would like to argue that finding correspon-
dences is the focus of schema matching, and our SeMap system is built on
a moderate set of current schema matching techniques, it can be expected
that with more powerful schema matching techniques or domain knowledge,
the number of incorrect matches can be significantly reduced. In the de-
tailed error analysis, one can also notice that on average SeMap has higher
accuracy in identifying Is-a relationship than Has-a relationship. This can
be explained by the fact that the thesaurus employed in SeMap (WordNet)
returns comprehensive information of meronym/holonym relationships be-
tween two words, even though the senses may not be exactly that appearing
in the context of schemas. The way to improve the accuracy of identifying
Has-a relationship can be (1) prune the noise senses in using the thesaurus,
which will be explored in our future work; (2) lowering the weight of the
sense matcher, which however may affect finding other types of relationships.
72
Chapter 5. Experimental Analysis
compositionschemas relationship Equivalent Is-a Has-a non-match
Real Equivalent 13 0 0 4Estate 1 Is-a 0 2 0 0
Has-a 4 2 8 7Real Equivalent 9 0 0 3
Estate 2 Is-a 0 5 0 3Has-a 3 0 7 4
Synthetic Equivalent 2 1 0 0Is-a 0 1 0 0
Has-a 0 0 4 0Course Equivalent 11 1 0 1Info 1 Is-a 0 2 0 0
Has-a 0 0 0 0Course Equivalent 12 0 0 0Info 2 Is-a 0 4 0 1
Has-a 0 0 1 4
Figure 5.2: Error analysis of the resulting mappings.
Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 20
20
40
60
80
100
Pre
cisi
on (
%)
EquivalentIs−AHas−A
Figure 5.3: The precision of SeMap after pruning incorrect matches. Thebars from left to right shows the matching results for the three relationshiptypes Equivalent, Has-a, Is-a respectively.
73
Chapter 5. Experimental Analysis
5.2.2 Component Contribution
In this section, we studied the relative contribution of different types of
semantic evidences to the matching results. Specifically, we tested three
types of semantic evidences, element label (name), element type and element
sense, which are available in most of schemas.
In each test, we left out one type of semantic evidence, and used the rest
two to identify the correspondences and the associated relationship types.
The weights of the two types of semantic evidences were both set as 0.5,
reflecting a neutral assumptions that both evidences are equally important.
The two plots of Figure 5.4 show the F-measure of identified matches
(correspondences) and identified relationships respectively. Within each
plot, the bars from left to right represent the results produced by SeMap
without label evidence, without type evidence, without sense evidence and
complete SeMap respectively. It is observed that in almost all the cases,
each type of semantic evidence contributes to the overall performance, ex-
cept for the dataset of Real Estate 2, where the F-measure of relationships
identified by a complete system is worse than that by a system without type
evidence. This can be explained by the fact that in some cases, different
semantic evidences are in conflict in determining the implicit relationships,
and some kinds of evidence may dilute the correct diagnosis.
5.2.3 Incorporating User Feedback
We also studied the performance gain by incorporating user feedback into
SeMap system. We applied the user interaction mechanism as discussed in
74
Chapter 5. Experimental Analysis
Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 20
20
40
60
80
100Matching Accuracy
F−
mea
sure
(%
)id
entif
ied
mat
ches
Real Estate 1 Real Estate 2 Synthetic Course Info 1 Course Info 20
20
40
60
80
100
F−
mea
sure
(%
)id
entif
ied
rela
tions
hips
semap w/o labelsemap w/o typesemap w/o sensecomplete semap
Figure 5.4: Relative contribution of different types of semantic evidences tothe matching results of SeMap. The two plots (from up to down) show the F-measure of identified matches (correspondences) and identified relationshipsrespectively.
Chapter 4: Each candidate match of a schema elements is associated with an
uncertainty estimate, based on its similarity score and domain constraints.
At each iteration, the schema element whose candidate matches have the
largest mutual entropy with other elements is selected, and user is asked to
provide the correct answer for this element. It then updates the uncertainty
estimation. The procedure repeats until a threshold is reached.
In this test, we measured the number of correct matches needed to be
provided before a perfect set of matches is reached. Note that we only
tested on the accuracy of identifying the correspondences, but not extracting
the implicit relationships. F-measure of correct correspondences versus the
75
Chapter 5. Experimental Analysis
amount of user interaction needed (percentage of expert matches provided
over the total number of matches) is shown in Figure 5.5 (the synthetic
dataset is skipped, since 100% correct correspondences is reached without
any help of user interaction). It is clear that over the datasets Real Estate
1, Real Estate 2 and Course Info 2, F-measure reaches its maximum value
when about 20% of expert matches is provided; While over the datasets
Course Info 1, about 10% of expert matches lead to a perfect set of matches.
This result suggests that SeMap can effectively incorporate user interaction,
that is it needs only a few equality constraints provided by users to achieve
high-accuracy matches. Note that the maximum possible F-measure may
not necessarily be 100%, this can be explained by the fact that F-measure
incorporates both precision and recall, and it is not always possible to achieve
100% recall (all correct matches are detected), due to the limit of the current
matching techniques that SeMap takes as input.
5.2.4 Discussion
In this section, we discuss the difficulties that prevent SeMap from achiev-
ing better performance in identifying both correspondences and implicit re-
lationships. Specifically, there are three reasons facing the current SeMap
system:
• More types of base matchers are needed in order to fully exploit the
possible schema-level information. In our current implementation,
only four types of base matchers are employed, which sometimes leads
to low precision, as discussed in Section 5.2.1, and it can be expected
76
Chapter 5. Experimental Analysis
0 20 40 60 80 10080
85
90
95
100Real Estate 1
provided matches (%)
F−
mea
sure
(%
)
0 20 40 60 80 10065
70
75
80
85
90
95
100Real Estate 2
provided matches (%)
F−
mea
sure
(%
)
0 20 40 60 80 10093
94
95
96
97Course Info 1
provided matches (%)
F−
mea
sure
(%
)
0 20 40 60 80 10080
85
90
95
100
Course Info 2
provided matches (%)
F−
mea
sure
(%
)
Figure 5.5: F-measure of correct correspondences versus the amount of userinteraction (percentage of expert matches provided over the total number ofmatches). The curves for four datasets (Real Estate 1/2, Course Info 1/2)are shown.
that more types of evidences can lead to finding more hidden informa-
tion;
• Data instances should be taken into consideration. It is only on the
data-instance level that one can understand the exact format and se-
mantics of schema elements.
• Parameter tuning is critical for schema matching. In SeMap system,
there are five parameters (three for weights of evidence, two thresh-
olds), which interact in a complicated way. A systematic (e.g., learning
approach [27]) tuning scheme can significantly improve the results.
77
Chapter 6
Conclusion & Future Work
In this thesis we presented an approach of identifying generic, semantic re-
lationships between the elements of two models (e.g., database schemas,
ontologies, web interfaces, etc) based on initial match information provided
by current schema matching techniques. Our main contributions include (1)
we point out the importance of the problem of identifying generic semantic
relationships between schema elements; (2) we designed an architecture for
semi-automatically constructing generic semantic mappings based on initial
correspondence information; (3) we created a novel probabilistic framework
that transforms match selection to a well-defined mathematical optimization
problem; (4) we effectively modeled the user interaction to help focus user at-
tention and minimize user effort, by detecting critical points where feedback
is maximally useful; (5) we proposed effective solution to extracting rela-
tionship implicit in matches based on various types of semantic evidences;
(6) we implemented a prototype system embodying the innovations above
and a set of experiments to illustrate the effectiveness of our approach.
We envision several future directions. The first is to incorporate our
system into a model management system [17], and explore the new possi-
bility in meta-data management brought by generic, semantically rich map-
78
Chapter 6. Conclusion & Future Work
pings. Second, we would like to enhance our current prototype by adding in
more matching techniques, and considering more types of semantic evidence.
Third, more domain constraints (e.g., frequency, contiguity, nesting [4], etc)
would be taken in consideration to enhance the current implementation of
match selector. Finally, in the process of judging which relationship is best
represented by the input correspondences, SeMap takes into account addi-
tional information (e.g., lineage information) that the matchers themselves
do not consider. As a result of this, if SeMapis unable to suggest an appro-
priate relationship, it may indicate that the input correspondence is wrong.
One future direction is to redirect such concerns to improve the quality of
the input match.
79
Bibliography
[1] Hans Chalupsky. Ontomorph: A translation system for symbolic knowl-edge. In Principles of Knowledge Representation and Reasoning, 2000.
[2] Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy, andPedro Domingos. imap: discovering complex semantic matches betweendatabase schemas. In SIGMOD ’04: Proceedings of the 2004 ACMSIGMOD international conference on Management of data, pages 383–394, New York, NY, USA, 2004. ACM Press.
[3] Hong-Hai Do and Erhard Rahm. Coma - a system for flexible combi-nation of schema matching approaches. In VLDB ’02: Proceedings ofthe 28TH international conference on Very large data bases, 2002.
[4] AnHai Doan, Pedro Domingos, and Alon Y. Halevy. Reconcilingschemas of disparate data sources: a machine-learning approach. InSIGMOD ’01: Proceedings of the 2001 ACM SIGMOD internationalconference on Management of data, pages 509–520, New York, NY,USA, 2001. ACM Press.
[5] David W. Embley, Li Xu, and Yihong Ding. Automatic direct andindirect schema mapping: experiences and lessons learned. SIGMODRecord, 33(4):14–19, 2004.
[6] A. Gal, G.A. Modica, and H.M. Jamil. Ontobuilder: Fully automaticextraction and consolidation of ontologies from web sources, 2005.
[7] Fausto Giunchiglia and Mikalai Yatskevich. Semantic matching. Knowl-edge Engineering Review, 18(3):265–280, 2004.
[8] P.E. Hart, N.J. Nilsson, and B. Raphael. A formal basis for the heuristicdetermination of minimum cost paths. IEEE Transactions on SystemsScience and Cybernetics (SSC), pages 100–108, 1968.
[9] Finn V. Jensen. Bayesian Networks and Decision Graphs. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2001.
80
Chapter 6. Conclusion & Future Work
[10] Yannis Kalfoglou and Marco Schorlemmer. Ontology mapping: thestate of the art. Knowledge Engineering Review, 18(1):1–31, 2003.
[11] Jaewoo Kang and Jeffrey F. Naughton. On schema matching withopaque column names and data values. In SIGMOD ’03: Proceedingsof the 2003 ACM SIGMOD international conference on Managementof data, pages 205–216, New York, NY, USA, 2003. ACM Press.
[12] Jayant Madhavan. Learning mappings between models of data. Re-search Proposal, 1999.
[13] Jayant Madhavan, Philip A. Bernstein, AnHai Doan, and Alon Halevy.Corpus-based schema matching. In ICDE ’05: Proceedings of the 21stInternational Conference on Data Engineering (ICDE’05), pages 57–68, Washington, DC, USA, 2005. IEEE Computer Society.
[14] Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. Genericschema matching with cupid. In VLDB ’01: Proceedings of the 27thInternational Conference on Very Large Data Bases, pages 49–58, SanFrancisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[15] Alexander Maedche, Boris Motik, Nuno Silva, and Raphael Volz. Mafra- a mapping framework for distributed ontologies. In EKAW ’02: Pro-ceedings of the 13th International Conference on Knowledge Engineer-ing and Knowledge Management Ontologies and the Semantic Web,pages 235–250, London, UK, 2002. Springer-Verlag.
[16] Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. Similarityflooding: A versatile graph matching algorithm and its application toschema matching. In ICDE ’02: Proceedings of the 18th InternationalConference on Data Engineering, page 117, Washington, DC, USA,2002. IEEE Computer Society.
[17] Sergey Melnik, Erhard Rahm, and Philip A. Bernstein. Rondo: a pro-gramming platform for generic model management. In SIGMOD ’03:Proceedings of the 2003 ACM SIGMOD international conference onManagement of data, pages 193–204, New York, NY, USA, 2003. ACMPress.
[18] Boris Motik, Alexander Maedche, and Raphael Volz. A con-ceptual modeling approach for semantics-driven enterprise applica-tions. In On the Move to Meaningful Internet Systems, 2002 -DOA/CoopIS/ODBASE 2002 Confederated International Conferences
81
Chapter 6. Conclusion & Future Work
DOA, CoopIS and ODBASE 2002, pages 1082–1099, London, UK,2002. Springer-Verlag.
[19] Renate Motschnig-Pitrik and Jens Kaasboll. Part-whole relationshipcategories and their application in object-oriented analysis. IEEETransactions on Knowledge and Data Engineering, 11(5):779–797,1999.
[20] Natalya Fridman Noy and Mark A. Musen. Prompt: Algorithm andtool for automated ontology merging and alignment. In Proceedings ofAAAI/IAAI, 2000.
[21] Rachel Pottinger and Phil Bernstein. Merging models based on givencorrespondences. U of Washington. Technical Report UW-CSE-03-02-03, 2003., 2003.
[22] Erhard Rahm and Philip A. Bernstein. A survey of approaches toautomatic schema matching. The VLDB Journal, 10(4):334–350, 2001.
[23] Erhard Rahm and Honghai Do. Data cleaning: Problems and currentapproaches. IEEE Bulletin of the Technical Committee on Data Engi-neering, 23(4), 2000.
[24] Xiao Renguo, Tharam S. Dillon, J. Wenny Rahayu, Elizabeth Chang,and Narasimhaiah Gorla. An indexing structure for aggregation rela-tionship in oodb. In DEXA ’00: Proceedings of the 11th InternationalConference on Database and Expert Systems Applications, pages 21–30,London, UK, 2000. Springer-Verlag.
[25] C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. ofComputer Science, University of Glasgow, 1979.
[26] Nikos Rizopoulos. Automatic discovery of semantic relationships be-tween schema elements. In ICEIS ’04: Proceedings of the 6th Interna-tional Conference on Enterprise Information Systems, 2004.
[27] Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan, and Arnon S.Rosenthal. Tuning schema matching software using synthetic scenarios.In VLDB ’05: Proceedings of the 31st international conference on Verylarge data bases, pages 994–1005. VLDB Endowment, 2005.
[28] Gerd Stumme and Alexander Maedche. FCA-MERGE: Bottom-upmerging of ontologies. In IJCAI ’01: Proceedings of the 7th inter-
82
Chapter 6. Conclusion & Future Work
national conference on Artificial Intelligence, pages 225–230, Seattle,WA, USA, 2001.
[29] Department of Computer Science University of Illinois. Illinois semanticintegration archive. http://anhai.cs.uiuc.edu/archive, 2002.
[30] Michael Uschold and Michael Gruninger. Ontologies and semantics forseamless connectivity. SIGMOD Record, 33(4):58–64, 2004.
[31] Wensheng Wu, Clement Yu, AnHai Doan, and Weiyi Meng. An inter-active clustering-based approach to integrating source query interfaceson the deep web. In SIGMOD ’04: Proceedings of the 2004 ACM SIG-MOD international conference on Management of data, pages 95–106,New York, NY, USA, 2004. ACM Press.
[32] Ling Ling Yan, Renee J. Miller, Laura M. Haas, and Ronald Fagin.Data-driven understanding and refinement of schema mappings. InSIGMOD ’01: Proceedings of the 2001 ACM SIGMOD internationalconference on Management of data, pages 485–496, New York, NY,USA, 2001. ACM Press.
83
Top Related