Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann...
Transcript of Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann...
Dynamic Integration and Query Processing with Ranked Role Sets
Peter Scheuermann
�
Dept. of EECS, Northwestern University
2145 Sheridan Road
Evanston, Illinois, 60208-3118
Email: [email protected]
Wen-Syan Li
y
Chris Clifton
z
CIMIC, Rutgers University The MITRE Corporation
180 University Ave K308, 202 Burlington Road
Newark, NJ 07102 Bedford, MA 01730
Email: [email protected] Email: [email protected]
Abstract
The role-set approach is a new conceptual framework for data integration in multidatabase systems
that maintains the materialization autonomy of local database systems and provides users with more
accurate information. The role-set approach presents the answer to a query as a set of relations where
the distinct intersections between the relations corresponding to the various roles played by an entity.
In this paper we show how the basic role-based approach can be extended in the absence of information
about the multidatabase keys (global IDs). We propose a strategy based on ranked role-sets that makes
use of a semantic integration procedure based on neural networks to determine candidate global IDs.
The data integration and query processing steps then produce a number of role-sets, ranked by the
similarity of the candidate IDs.
1 Introduction
The capability to make a large number of databases interoperable has become a crucial element in the
development of new information systems. The number of databases that may potentially cooperate in
a given organization can be very large; on the other hand a particular application is most likely to use
only a small subset of these databases. Multidatabase or federated database systems provide for the
interoperability of autonomous database systems without requiring their global integration [MH80, LMR90,
SL90, SC94, PS95].
In order to answer queries in multidatabase systems three distinct processes need to be performed by
the user, database administrator, and/or system as shown in Figure 1. The Schema Integration includes a
possible schema transformation step, followed by correspondence identi�cation, and an object integration
�
The author's work is supported by NSF grant IRI-9303583 and NASA grant NAG2-846.
y
This material is based upon work supported by the National Science Foundation under Grant No. CCR-9210704. The
work described in this paper was performed when the author was obtaining his Ph.D. at Northwestern University, Dept. of
EECS.
z
This material is based upon work supported by the National Science Foundation under Grant No. CCR-9210704. The
views and opinions in this paper are those of the author and do not re ect MITRE's work position.
1
Multidatabase Query
Sub-Query
Tuples Tuples
Sub-Query
Data Integration
Users
Query Processing Integration
Global ID
DBAs
SchemaIntegratedSchema /Mapping
1 2Database
Component Component ComponentDatabase Database
N
Figure 1: Multidatabase Query Processing
and mapping construction step [PS95]. A major subproblem in the correspondence identi�cation step is
semantic integration: determining which attributes are equivalent between the databases [LC94]. In Query
Processing, global queries are reformulated into sub-queries, the sub-queries are executed at the local sites,
and their results are assembled at a �nal site. The Data Integration is complimentary to Query Processing,
i.e, it determines how the values from di�erent local databases should be merged or presented at the �nal
site. In fact, Query Processing is impacted by the approaches chosen for both Schema Integration and
Data Integration. For example, if structural di�erences are resolved via generalization [HDG84] and data
integration is performed via aggregate operators, the cost of local selections in query processing becomes
very expensive.
In [SC94] we introduced a new conceptual framework for data integration in multidatabase systems
that maintains the materialization autonomy of the local database systems involved and provides users
with more accurate information if they so desire. The role-set approach is based on the observation that
many con icting data values are not actually inconsistencies (as assumed in [AKWS95]), but values that
correspond to the same real world object appearing in multiple roles. The role-set method presents the
answer to a query as a set of relations representing the distinct intersections between the relations corre-
sponding to the various roles. A basic assumption of the role-set approach is that the a multidatabase key
(global ID) is known that can serve as a global object identi�er to relate the object instances corresponding
to a real world entity. In this paper we show how the basic role-based approach can be extended in the
absence of information about the multidatabase keys (global IDs). We propose a strategy based on ranked
role-sets that makes use of a semantic integration procedure based on neural networks (Semint [LC95])
to determine candidate global IDs with di�erent degrees of similarity. The Data Integration and Query
Processing steps then produce multiple role-sets, ranked by the similarity of the candidate global IDs.
1.1 The Problem
We illustrate the problem and our approach using the following example: Assume that our multidatabase
integrates two local schemas, namely FACULTY (SS#, Faculty Name, Salary) and STUDENT(Stud ID,
2
Student: (Stud_ID, Stud_Name, Stipend, Tel#)
Faculty: (SS#, Facu_Name, Salary)
Global ID !similarity=0.98 similarity=0.21
Golbal ID ?
Figure 2: Possible Global IDs in Faculty and Student
Stud Name, Stipend, Tel#), as shown in Figure 2. Suppose that we want to �nd people with salaries
greater than $30,000. The salary of a student instructor may come from two sources, faculty salary and
student stipend. Since a student instructor appears in both FACULTY and STUDENT roles, the role-set
approach presents the answer to the query as three distinct intersections between the two schemas: The set
of those qualifying persons who appear only in the role FACULTY; the set of qualifying persons only in the
role STUDENT; and the set of persons who play both roles, FACULTY \ STUDENT. However, to create
the intersection FACULTY \ STUDENT, the user has to know the global ID that de�nes which items in
FACULTY and STUDENT refer to the same person. If a complete integration schema is not available and
the user doesn't have this knowledge, he may try to do this intersection on the global ID(Faculty.SS#,
Student.Tel#) since Tel# looks similar to SS#. The user then receives the surprising result of an empty
set.
Instead of this, using the ranked role-set approach the query �rst invokes an automated semantic
integration tool (Semint) [LC95] that produces a list of attributes in the STUDENT database that are
likely to correspond to FACULTY.SS#: STUD ID (with a similarity of 0.98) and Tel# (with a similarity
of 0.21). The Query Processing process will then issue two multidatabase queries that use (Faculty.SS#,
Student.Stud ID) and (Faculty.SS#, Student.Tel#) as global IDs, respectively. The user will be presented
with two role sets, ranked by their degrees of con�dence. Here we rank the degree of con�dence of a
result based on the similarity of corresponding attributes used for global ID. The �rst role set, constructed
using (Faculty.SS#, Student.Stud ID) as the global ID, consists of three intersections: FACULTY only,
STUDENT only, and FACULTY \ STUDENT with a high degree of con�dence (0.98). The second role
set, constructed using (Faculty.SS#, Student.Tel#) as global ID, consist of only two role sets: FACULTY
only and STUDENT only, as the FACULTY \ STUDENT set is empty. This result has a low degree of
con�dence (0.21). The second query using (Faculty.SS#, Student.Tel#) could have been avoided if the
user would have speci�ed a threshold for the similarity of attribute pairs.
The end user is able to distinguish between unreasonable and reasonable answers by examining the role
sets ranked with the appropriate degrees of con�dence. As a consequence, most likely he will come to the
correct conclusion that (Faculty.SS#, Student.Stud ID) is the global ID. In consequent queries the user
uses only GID(Faculty.SS#,Student.Stud ID) and no semantic integration needs to be carried out, as the
GID is known. As in [ON94], the user query provided the context for semantic integration. The user e�ort
involved in the integration is limited to identifying the databases and relations; the rest of the semantic
integration is automated.
3
1.2 Paper Organization
The rest of this paper is organized as follows. In Section 2 we review related work in multidatabase
and federated database systems. Section 3 outlines our semantic integration tool (Semint). Section 4
describes the basic role set approach to data integration and query processing. In Section 5 we provide
a new framework for dynamic integration and query processing based on ranked role sets. In Section 6
we provide example scenarios to illustrate our approach. In Section 7 we discuss the scalability of our
approach and extensions to dealing with partial global IDs. Finally we give our concluding remarks in
Section 8.
2 Related Work
Early work in multidatabase architectures focused on procedures to merge individual schemas into a single
global conceptual schema. The global schema multidatabase approach requires complete integration; the
global schema must map all local schemas to a single global view. The amount of knowledge required
about local schemas and how to identify and resolve heterogeneity among the local schemas is a major
problem with this approach. The global schema must be developed before any queries can be issued.
Changes to local schemas must be re ected by corresponding changes in the global schema. This causes
major di�culties in maintaining the global schema. Because of the complexity of a global schema, a small
change to a local schema (e.g. add or delete an attribute) may require huge changes to the global schema.
[GMS94] argues that existing a-priori or static integration strategies might provide satisfactory support
for small or static systems, but not for large-scale interoperable database systems operating in a dynamic
environment.
Federated databases [MH80, PS95], only require partial integration, integrate a collection of loosely cou-
pled local database systems by supporting interoperability rather than through a complete global schema.
However, although the impact of a change to a local schema may be smaller, any change to the local
schema may require some change to the federated schema. Maintaining these mappings is still di�cult.
Multidatabase systems [HDG84, LAZ
+
89, SC94, BHP94] attempt to resolve the problems described
above by discarding completely with the need of a global or partial schema. This approach puts the
integration responsibility on the user by providing him with functionality beyond standard SQL in order
to specify some integration information as part of the query. MSQL [LAZ
+
89] was the �rst multidatabase
language proposed as an extension of SQL. The answer to a query in MSQL is a multi-relation: a set of
relations dynamically created by the query that come from di�erent local databases. Extensions to SQL
were also provided in the federated database context [ZSC95]. The universal relation model was used in
[ZSC95] in order to express the metadata in the federated schema, and thus reduce the di�culty of writing
queries in the federated SQL.
Although multidatabase system languages eliminate the need for a global or partial schema, some
important issues in the schema integration process remain to be solved. The three step schema integration
process described in [PS95], includes, as mentioned before, a pre-integration procedure for putting the local
schemas in more homogeneous format, correspondence identi�cation, and the actual integration procedure.
4
Multidatabase languages [LAZ
+
89, SC94] eliminate the need to perform the last step, but the crucial
correspondence identi�cation is still required in order to specify Global IDs.
The process of identifying corresponding attributes is also referred as semantic integration and has
been recognized as one of the fundamental problems in database systems interoperability [SK92]. In
semantic integration, attributes (classes of data items) need to be compared pairwise to determine their
equivalence. [LC94] points out that three levels of metadata can be automatically extracted from the
the local databases and used subsequently to aid the semantic integration process: attribute names (the
dictionary level), schema information (the �eld speci�cation level) and data contents and statistics (the
data content level). As GM's e�orts have shown, the attribute name metadata was not su�cient for
semantic integration [PB94]; using only this level of metadata only a few obvious matches were found.
However, similarities in schema information were found to be useful. For example, it was discovered in one
case that attributes of type char(14) were equivalent to those of char(15) (with an appended blank). A
further distinction can be made between manual methods for semantic integration, versus automated or
semi-automated methods [LC94]. The semantic integration procedure based on neural networks (Semint
[LC95]) is an example of the later.
The methods used for resolving structural di�erences during schema integration impact the data in-
tegration process. Data integration is concerned with combining the data values that re ect the same
information for a given entity whose components may come from multiple local databases. Structural dif-
ferences have been resolved by one of the following methods: outerjoins [RR84], generalizations [HDG84],
multiple relations [LAZ
+
89], role-sets [SC94], exible relations [AKWS95], or universal relations [ZSC95].
The universal relation approach implies that expensive outerjoins are required for data integration. Gener-
alizations require that aggregate operations are used to resolve inconsistencies. We argue that is a violation
of materialization autonomy, namely that the views presented by the local database systems are not pre-
served in the answers given by the federated or multidatabase system. The approaches based on multiple
relations, role-sets, and exible relations all extend the concepts of the classical relational model in order
to deal with data inconsistencies. In the role-set approach real-world objects may belong to di�erent in-
tersections depending upon the number of roles for which they have materializations. In comparison, in
exible relations all real world objects are represented uniformly as clusters of tuples that may appear in
one ( exible) relation only.
It is important to further distinguish between static versus dynamic methods of integration. Static
schema integration requires that the integration is completed before any queries can be issued against the
multidatabase system. Most of the methods for semantic integration reviewed above are static. Some of
them are highly automated, but the semantic integration must be �nished before any queries are written.
We use the term static data integration to refer to those methods that determine in advance the method
to resolve data inconsistencies and hence all inconsistencies are resolved in the same way. Aggregation
applied in the case of generalizations is an example of a static data integration method [HDG84]. On the
other hand, role-sets [SC94] and multiple relations [LAZ
+
89] qualify as dynamic data integration methods.
The result of a query consists of multiple relations or sets; furthermore, the number of relations in which
a real-world object appears or the particular intersection of the role-set in which it belongs is determined
5
dynamically by the number of distinct materializations (roles) that it possesses.
In this paper we combine role-sets with Semint [LC95] in order to perform semantic and data integration
dynamically. This reduces the a-priori or static e�ort required for schema integration in multidatabase
systems. The semantic integration is being performed incrementally, only when needed, thus no global
integration is required. This process is being performed as a by-product of query processing with the user
query providing the context for the semantic integration. Speci�cally, the query result presented to the
user is a sequence of ranked role-sets, one for each candidate global ID. The user examines these role-sets
in order to determine which makes most sense and hence determines the global IDs dynamically.
We will now give more details on Semint and the role-set approach and then discuss our dynamic
integration method.
3 Semantic Integration with Neural Networks
Neural networks have emerged as a powerful pattern recognition technique. They are useful in a wide
range of application. Unlike traditional programs, neural networks are trained, not programmed. Neural
networks act on data by detecting underlying organizations or clusters. For example, the input characters
can be grouped by detecting how they resemble one another. The networks learn the similarities among
patterns directly from the instances of them. That means neural networks can infer classi�cation without
prior knowledge of regularities.
Traditional algorithmic approaches are best for tasks where exact rules are easy to de�ne and perfect
accuracy is critical. Neural networks, on the other hand, have advantages when dealing with imperfect
data, classifying data without obvious rules, and discovering relationships between data. We feel that
neural networks are more suitable than traditional algorithms for determining the semantic equivalence
between a pair of attributes since:
� The availability of metadata and the semantics of terms may vary and the relationship between two
attributes is usually fuzzy;
� It is hard to de�ne and assign probabilities to rules for comparing aspects of two attributes. The
knowledge of how to match attributes needs to be discovered directly from data, not pre-programmed;
and
� Pre-de�ned rules and probabilities that work for one pair of databases may not work for other pairs
of databases. They need be adjusted dynamically.
Semint (SEMantic INTegrator) [LC95] is a system for semantic integration based on neural network
techniques. Figure 3 outlines the semantic integration procedure in Semint. In this procedure, DBMS
speci�c parsers extract metadata - schema information (such as data types, length, scale, precision, and
existence of constraints such as primary keys, foreign keys, candidate keys, value and range constraints,
disallowing null values, and access restrictions) and data content statistics from a small portion of sample
data (such as maximum, minimum, average (mean), variance, coe�cient of variance, existence of null
6
DBMS
Specific
Parsers
Extract
Database
Information
Trained
Networks
Determine
Similarity
Between
Attributes
Classify
Attributes
And
Generate
Training
Data
Train
Networks
To
Recognize
Patterns
Similarity
Equivalent
Attributesand
Users
Check
and
Confirm
Results
Trained
Networks
Cluster
CentersData Contents
and Schema
Figure 3: Overview of the Semantic Integration Procedure in Semint
(A) (B)
3
M
1
2
M nodes
3
2
1
Input layer
Length
Datatype
Valueconstraint
Key
C
C
C
1
2
3
MC
Output layer
N
4
N nodes
Average
Employee.id#
Dept.employee
Payroll.SSN
Name
Address
Telphone#
X
X
X
X
X
1
2
3
4
N
Payroll.SSN
Dept.employee
Employee.id#
1
2
3 3
M
1
2
(N+M)/2 nodes
M nodes
2
1
Hidden layer
Input layer
0.92
0.05
Value
Length
Key
N nodes
N
4
Average
Datatype
Output layer
Name
Address
Categories:
Category 3:
Telephone#
Attribute:Health_Plan.Insured#
constraint
0.08
0.03
Figure 4: (A) Classi�er in Semint (B) Back-Propagation Neural Network Result
values, and existence of decimals). The metadata is then transformed into a single format (so they can
be compared). If a database contains 10 attributes and we extract 20 characteristics to describe the
\signatures" of these attributes, the parser output has 10 vectors, where each vector has 20 values in the
range of [0::1]. The details of normalization process is described in [LC94]. Then, a classi�er is used
to cluster attributes into categories in a single database. The classi�er output is used to train a neural
network to recognize these categories; this trained network can then determine similar attributes between
databases.
The only human input is to specify DBMS types and database connection information and to examine
and con�rm the output results. Other processes can be fully automated. Users are shown corresponding
attributes with similarity greater than threshold set by the users. The users can further specify the
maximum number of similar attributes to be retrieved (e.g. top 10 pairs with similarity > 0.8). Note that
the training of a neural network needs to be done only once; the actual use of the network to determine
attribute correspondences is very e�cient and can be done almost instantly.
Semint uses the Self-Organizing Map algorithm, an un-supervised learning algorithm, as the classi�er, to
categorize attributes within a single database. We have adapted this algorithm so that users can determine
how �ne these categories are by setting the radius of clusters rather than the number of categories; if
desired users may examine the output and adjust this radius to cluster like attributes together. The
output of the classi�er is the vectors of cluster center weights. As shown in Figure 4(A), \Employee.id#",
7
\Dept.employee", and \Payroll.SSN" are clustered into one category since their input characteristics (and
real world meanings) are close to each other.
The output of the classi�er (M vectors) is then used as training data for a back-propagation network,
a supervised learning algorithm. The \supervision" is that target results are provided; however as these
target results are the output of the classi�er, no user supervision is needed. The similarities are a measure
of how close the vector describing an input attribute is to each of the vectors of the training data (cluster
centers). After the back-propagation network is trained, we present it with a new attribute (a vector) such
as \health Plan.Insured#" which is not in the training data. This trained neural network determines the
similarity between \health Plan.Insured#" and each of the M categories. In Figure 4(B) the network shows
that the input pattern \Insured#" is closest to the category 3 (\Employee.id#", \Dept.employee", and
\Payroll.SSN") with similarity=0.92 but it is not similar to other categories because of the low similarity.
The \distance function" for close is not pre-de�ned, but is learned directly from the database semantics
during the training process, and will vary depending on the information contained in the database (allowing
Semint to adjust itself to di�erent database domains). Therefore similarity does not correspond to a
percentage or �xed distance function, but is a domain-speci�c value that can be used to rank the likelihood
that two attributes re ect the same real-world information.
The performance of Semint on databases with a number of attributes less than 40 is satisfactory (on
an IBM RS/6000 Model 220) - less than 0.1 second CPU time to classify attributes, less than 7 seconds
to train neural networks, and less than 0.1 second to determine the similarity. The Recall is excellent
(100%) with Precision ranging from 90% to 100%. For detail experimental results, please see [LC94]. In
Section 7.1 we further discuss the scalability of using neural networks on large databases with hundreds of
attributes (an environment where manual integration is almost an impossible task.)
4 Role-Set Based Query Processing in Multidatabase Systems
One challenge in multidatabase query processing is merging intermediate results from heterogeneous local
databases. Because of local database autonomy, we are not able to change the data structures of local
databases or enforce local databases to prepare their results according to certain an uniform format or
structure. Therefore, data integration is necessary and needs to be carried out at the global site before
presenting �nal results to users. We argue that many of the so-called inconsistencies that appear in data
integration are not real inconsistencies, but re ect the fact that we are dealing with di�erent values that an
attribute can take for distinct roles of the same real-world entity. A new concept of data integration based
on role sets was �rst proposed in [SC94]. Using the role set approach a user has the option to specify how
the system should resolve inconsistencies, e.g., whether users want to see aggregate values for an attribute
or all the values derived by the individual systems.
4.1 Data Integration with Role Sets
In order to illustrate these concepts we consider the FACULTY and STUDENT example as shown in
Figure 5. The object with Global ID Z appears in both roles, while the other objects appear in only
8
Z 60KW 25K
X 50K
STUDENT
Y 35K Z 30K
FACULTY
Figure 5: FACULTY and STUDENT Roles
Z 45K
: :
X
Y
50K W 25K Z
Z
60K
30K
FACULTY ONLY STUDENT ONLY FAC STU
FACULTY STUDENT
X
Y
Z
50K
35K
60K
Z 30K
.....
......
W 25K
35K
(A) GENERALIZATION APPROACH (B) MULTIPLE RELATIONS APPROACH
(C) ROLE-SET APPROACH
Figure 6: Approaches of Resolving Structural Di�erences
one role. Let us now consider the query \retrieve all persons with a salary greater than $30,000." When
schema/data integration is performed according to the generalization approach described in [HDG84], the
federated database is viewed as consisting of the generalized entity set (e.g. PERSON(ID,Salary)). In
addition, the inconsistency in data values is solved by de�ning an aggregate function such as average
over the overlapping data values as shown in Figure 6(A). An alternative approach has been taken in
[LAZ
+
89], where in response to the above query the user is presented with multiple relations, one for each
role, as shown in Figure 6(B). The problem with this approach is that tuples corresponding to real-world
objects that appear in both FACULTY and STUDENT roles, as happens with the object with GID=Z,
are scattered through both tables and the answer is hard to visualize.
Using the role-set approach, the user has the option to specify that the answer should be presented as
a set of sets (role-set) representing the distinct intersections between the various roles as shown in Figure
6(C). Notice that an intersection such as FACULTY \ STUDENT in Figure 6(C) contains, for each real-
world object, all the tuples representing the distinct materializations where that object appears. Hence, this
set does not correspond directly to a relation in the traditional sense, since it contains tuples corresponding
to di�erent schemas that are not union-compatible. We denote these entity-based intersections between
relations R
1
; R
2
; :::R
n
, corresponding to distinct roles, as a role-set, Role-set(R
1
; R
2
; :::; R
n
). Figure 7
illustrates conceptually the di�erent elements of Role-set(R
1
; R
2
; R
3
).
9
R 1 R 2
R 3R 1 R 2 R 3
R 3 R 2 R 1 R 3 R 2
R 1 R
=
2 R 3
R 1
R 1
)(
( ) ( )
--
-__
Figure 7: Role-set of (R
1
; R
2
; R
3
)
4.2 Role-Set Based Query Formulation
We have de�ned extensions to MSQL that allow us to specify and manipulate role-sets. A role-set is created
via the MULTISELECT statement as illustrated in the following example:
MULTISELECT * FROM ROLE-SET(X,Y) USING GID
where GID is a global identi�er and X and Y are the roles according to which the various entity-based
intersections are created. The result of this statement is conceptually equivalent to executing the following
pseudo-SQL code:
(X-Y) : SELECT � FROM X WHERE GID NOT IN (SELECT ID FROM Y)
(Y-X) : SELECT � FROM Y WHERE GID NOT IN (SELECT ID FROM X)
(X\Y) : SELECT � FROM X,Y WHERE (X.GID = Y.GID) =
SELECT � FROM X WHERE GID IN (SELECT GID FROM Y) \+"
SELECT � FROM Y WHERE GID IN (SELECT GID FROM X)
Note that \+" stands for a pseudo-union that contains all the materializations of a real-world object. To this
role-set we can now apply modi�ed select, project, and aggregate operations as well as the r-join operation
that performs joins between elements of the role-set. At the end, after all operations are performed, the
result is transformed for presentation to the user in standard relational form, i.e., in the case of (X\Y),
it consists of tuples of the form (GID,X-attributes,Y-attributes). The modi�ed existential and universal
quanti�ers allow us to perform selection from the role-set in the following fashion. A WHERE clause with
a 9 condition will select all materializations of a real-world object in an intersection R
1
\ R
2
:::R
n
as long
as one of them satis�es the selection criteria. On the other hand, a clause with the 8 condition implies
that that the select condition must be satis�ed by all the materializations of an object.
Example. Assume that relations R and S represent two di�erent roles of an object and they contain tuples
with attributes (GID,Salary). The current extensions are R = (< 1; 50K >;< 2; 25K >;< 3; 12K >) and
S =(< 1; 25K >;< 3; 30K >). The intersection of R\S is performed with respect to GIDs : (R \ S)
ENT
= (< 1; 50K >;< 1; 25K >;< 3; 12K >;< 3; 30K >). A select(9) with respect to the quali�cation
Salary > 20K will return the set (R\ S)
ENT
since for each of the two objects at least one materialization
quali�es. On the other hand, a select(8) with respect to the same quali�cation will return the set (R\S)
0
= (< 1; 50K >;< 1; 25K >) since one of the materializations of the object with GID=3 does not qualify.
2
10
4.3 Role-Set Based Query Processing
A major problem in query optimization in multidatabase systems is the lack of auxiliary access paths at
the global site where most relations handled are intermediate results. Hence it is important to devise
an e�cient query processing strategy for dealing with intermediate data. The intermediate data to be
generated at the global site consists of a private part and overlap parts, containing the identi�ers of the
corresponding tuples in the intersections de�ned by the role-set (plus optional aggregation attributes). In
the generalization approach of Hwang et al. [HDG84], n semi-outerjoin operations need to be performed
at the global site to �nd the private and overlap parts of n relations. In addition, their strategy su�ers
from high communication costs since local selection is possible only for the private parts of the query and,
in the case of aggregate queries, no local selection is possible at all.
We have developed a new strategy for query processing based on our role-based model [SC94] that aims
at minimizing the amount of data to be transmitted between the local sites and the global site as well as
reducing the processing costs required at the global site for dealing with intermediate data. Our strategy
makes e�ective use of merge-sort/scan to produce in one iteration the private part and the various overlap
parts of the query. The private part of the query consists of the set of GIDs that appear only in one role
and satisfy the query, while the overlap parts correspond to intersections containing GIDs in multiple roles.
Thus for Role-set(R
1
; R
2
; R
3
) the private part consists of the GIDs in (R
1
\ R
2
\R
3
) [ (R
1
\R
2
\ R
3
) [
(R
1
\R
2
\R
3
) that satisfy the query, while the overlap parts are the GIDs in R
1
\R
1
\R
3
; :::;R
1
\R
2
\R
3
that satisfy the query.
The basic role-based query processing algorithm is outlined below:
Step 1: Local sites send set of GIDs and subset of GIDs selected.
Step 2: The global site (GS) performs a merge-sort for each role Ri and one merge-scan to produce the
private part and overlap parts simultaneously. GS sends the private part and overlap parts to the
each local site.
Step 3: Local sites perform projection for queries without aggregation (projection and selection for ag-
gregation) and send target attributes (and optional r-join attributes) to GS.
Step 4: GS merges data from various sites into corresponding intersections (and merge-sorts it for an
r-join).
Step 5: Optional r-joins executed.
Note: By global site we mean that site in the federation that receives the query from the user. Any site
in the federation can act as the global site for a particular query.
5 Dynamic Integration with Ranked Role Sets
Dynamic data integration methods, such as role-sets, still require that a Global ID be known to serve
as local and global object identi�ers. We present an approach of dynamic data integration and query
11
processing with ranked role sets in which we release the assumption that a global ID is known in advance.
The basic role-set approach can be extended in the absence of information about the multidatabase keys
(global ID). We see three scenarios for dynamic integration, depending on the level of knowledge of global
IDs that a user may have when issuing a query:
1. All databases are well understood and a global ID is known if it exists.
2. A subset of databases are well understood. A partial global ID is known across these databases,
however the corresponding attributes in other databases need to be determined to come up with a
global ID.
3. No databases is well understood; global IDs need to be determined \from scratch".
5.1 Query Language
Our approach works in all three scenarios, so users can submit a multidatabase query whether or not a
Global Id is known. This requires extensions to the role-set MSQL to allow specifying a Global ID, or
requesting that one be determined automatically. The extensions are shown as follows:
MULTISELECT attribute names
FROM ROLE-SET ( relation 1 relation 2 ... relationn )
USING GID ( relation 1.(attribute j �) = relation 2.(attribute j �) = . . .=
relationn.(attribute j �) [ WITH SIMILARITY > threshold ]
[ UP TO m SETS ] )
The \MULTISELECT" clause speci�es which attributes to retrieve. The \FROM ROLE-SET" clause
speci�es the roles according to which the various entity-based intersections are created (as described in
Section 4.2. The \USING GID" clause speci�es what is known about Global IDs, and what needs to be
determined. If the GID is known in advance (the �rst scenario), all attributes are speci�ed.
Data integration and query processing when the GID is known are as described in Section 4. In the
second scenario (one attributes is speci�ed as \�") and third scenario (both attributes are speci�ed as
\�"), Semantic Integration needs to be carried out to generate attribute correspondences as candidate
global IDs. The clauses \WITH SIMILARITY > threshold" and \UP TO m SETS" are optional, and
are used to restrict the number of potential GID candidates (e.g., \WITH SIMILARITY > 0.9 UP TO 5
SETS").
5.2 Dynamic Integration and Query Processing Procedures
The overview of our framework architecture is shown in Figure 8. The procedure is outlined below:
Pre-multidatabase-query process (semantic integration)
Step 1: The users submit a multidatabase query to retrieve semantically similar data items. The \USING
GID" clause speci�es the type of global ID assumption.
12
DataIntegration
QueryProcessing
IntegrationSemantic
MultidatabaseQuery
Users
Attribute
Correspondence
2 N
ComponentDatabase
1
Component ComponentDatabase Database
RankedRole Sets
Figure 8: Dynamic Integration in Query Processing with Ranked Role Sets
Step 2: If the global ID is unknown, Semantic Integration Process (Semint) at the global site extracts
metadata from the local component databases.
Step 3: Semint uses the metadata extracted in Step 2 to generate attribute correspondences as candidate
GIDs according to the user-speci�ed or default similarity threshold.
Multidatabase Query Processing and Data Integration
Step 4: Multidatabase Query Processor re-formulates the original multidatabase query into multiple mul-
tidatabase queries according to attribute correspondences. One multidatabase query is generated for
each candidate GID from the attribute correspondences whose similarity is greater than the threshold.
Step 5: The multidatabase query processor generates sub-queries for each multidatabase query generated
at Step 4 and then submits sub-queries to local component databases.
Step 6: The local component databases return the result tuples of the sub-queries executed at the local
sites to the originating site.
Step 7: The Data Integration process merges the intermediate results from various local sites by consulting
the attribute correspondences. The results are presented to the users as role sets with degrees of
con�dence (ranked role sets). The degree of con�dence of a ranked role set is based on the similarity of
attribute correspondence used as a GID. One set of role sets is generated for each pair of corresponding
attributes.
Note: Steps 2 and 3 can be done in anticipation of possible queries to improve query response time and
only need to be done once.
13
[ Faculty ] [ Student ]
SS# Facu_Name Salary Stud_ID Stud_Name Stipend Tel#
=========== ========== ======== =========== ========== ======= ========
493-45-8735 John $100,000 476-34-5748 Jason $11,000 674-7456
956-45-0456 Robert $80,000 958-46-3256 Steven $0 765-0945
849-45-0500 Mary $60,000 485-75-2374 Mary $0 767-5134
485-95-6784 Larry $20,000 485-95-6784 Larry $12,000 767-0900
------------------------------- ----------------------------------------
Figure 9: Faculty and Student Databases
6 Example Scenario
In this section, we use some sample queries to demonstrate the dynamic integration and query processing
of our approach. Imagine that we are planning a university budget. We want to know
What are the salaries of student instructors?
The salary of a student instructor may come from two sources: Faculty salary from the University and
student stipend from the Graduate School. The faculty salary information is stored in the Faculty database
and student stipend information is stored in the Student database, as shown in Figure 9. Here we only
list relevant attributes for ease of illustration.
As we discussed in Section 5, we see three scenarios based on database knowledge: all databases are well
understood so that a global ID is known, only subset of databases are well understood (only a partial global
ID is known), and all databases are unknown (global IDs need to be determined by semantic integration).
6.1 Global ID is Known
In the �rst scenario, the user knows that Faculty.SS# and Student.Stud ID form a valid global identi�er.
The query can be posed as:
MULTISELECT *
FROM ROLE-SET(Faculty,Student)
USING GID (Faculty.SS# = Student.Stud_ID)
The result of this query is shown in Figure 10. The semantic integration does not to be carried out because
GID is known. The data integration and query processing is dynamic because users can specify how to
generate the role set to revolve structural di�erences.
6.2 ID is known for Faculty Database
In this section we discuss how our approach works in the second scenario: Only subset of databases are
understood. A partial global ID is known, however the global ID, if exists, needs to be determined by
semantic integration before the query can be executed. We are familiar with Faculty; however, we have
little knowledge about Student. We know the two databases should contain some similar data items such
as salary, social security number, and name. We can submit the follow multidatabase query to retrieve the
salaries of student instructors using Faculty.SS# be used as the part of global ID. Because the Student
14
[ Faculty only role-set ] [ Student only role-set ]
SS# Facu_Name Salary Stud_ID Stud_Name Stipend Tel#
=========== ========== ======== =========== ========== ======= ========
493-45-8735 John $100,000 476-34-5748 Jason $11,000 674-7456
------------------------------- ---------------------------------------
956-45-0456 Robert $80,000 958-46-3256 Steven $0 765-0945
------------------------------- ---------------------------------------
849-45-0500 Mary $60,000 485-75-2374 Mary $0 767-5134
------------------------------- ---------------------------------------
[ Faculty and Student role-set ]
Facu_Name Salary
(SS#=Stud_ID) Stud_Name Stipend Tel#
============= ========== ======= ========
485-95-6784 Larry $20,000
Larry $12,000 767-0900
-----------------------------------------
Figure 10: Answer of Using (SS#,Stud ID) as GID with Degree of Con�dence = 0.98
database is not well understood, we specify the corresponding attribute in Student as \�". Semint will
then determine the possible corresponding attributes in Student to use with Faculty.SS# as the GID.
The query follows:
MULTISELECT *
FROM ROLE-SET(Faculty,Student)
USING GID (Faculty.SS# = Student.* WITH SIMILARITY > 0.8 UP TO 3 SETS)
The clause \USING GID (Faculty.SS# = Student.�)" causes the Semantic Integration Process to �nd
candidate corresponding attributes in Student database to be combined with Faculty.SS# as a global
ID. The clause \WITH SIMILARITY > 0.8" restricts candidate attributes to those that have a degree of
similarity greater than 0.8. The query is processed in the following steps:
Step 1: Semantic Integration. Semint recommends the attribute correspondence as:
(Faculty.SS#, Student.Stud_ID, similarity = 0.98)
Step 2: Query Re-formulation. The \�" is replaced by the corresponding attribute (Student.Stud ID)
found in the previous step. However, if multiple corresponding attributes are recommended by
Semint, one multidatabase query is generated for each corresponding attribute.
MULTISELECT *
FROM ROLE-SET(Faculty,Student)
USING GID (Faculty.SS# = Student.Stud_ID)
Step 3: Multidatabase Query Processing and Data Integration with Ranked Role Sets. Because Semint
only recommends one candidate GID, only one set of role sets is presented to the user. The result is
shown in Figure 10.
15
[ Faculty only role-set ] [ Student only role-set ]
SS# Facu_Name Salary Stud_ID Stud_Name Stipend Tel#
=========== ========== ======== =========== ========== ======= ========
493-45-8735 John $100,000 476-34-5748 Jason $11,000 674-7456
------------------------------- ---------------------------------------
956-45-0456 Robert $80,000 958-46-3256 Steven $0 765-0945
------------------------------- ---------------------------------------
[ Faculty and Student role-set ]
SS# Salary
(Facu_Name=Stud_Name) Stud_ID Stipend Tel#
===================== =========== ======= ========
Larry 485-95-6784 $20,000
485-95-6784 $12,000 767-0900
--------------------------------------------------
Mary 849-45-0500 $60,000
485-75-2374 $0 767-5134
--------------------------------------------------
Figure 11: Result of Using (Facu Name,Stud Name) as GID with Degree of Con�dence = 0.91
[ Faculty only role-set ] [ Student only role-set ]
SS# Facu_Name Salary Stud_ID Stud_Name Stipend Tel#
=========== ========== ======== =========== ========== ======= ========
493-45-8735 John $100,000 476-34-5748 Jason $11,000 674-7456
------------------------------- ----------------------------------------
956-45-0456 Robert $80,000 958-46-3256 Steven $0 765-0945
------------------------------- ----------------------------------------
849-45-0500 Mary $60,000 485-75-2374 Mary $0 767-5134
------------------------------- ----------------------------------------
485-95-6784 Larry $20,000 485-95-6784 Larry $12,000 767-0900
------------------------------- ----------------------------------------
[ Faculty and Student role-set ]
SS# Facu_Name
Salary=Stipend Stud_ID Stud_Name Tel#
============== ========== ========= ========
--------------------------------------------
Figure 12: Result of Using (Salary,Stipend) as GID with Degree of Con�dence = 0.85
16
6.3 Nothing is known about the Global ID
The query in previous section would not give a correct result if there is no attribute in Student corre-
sponding to Faculty.SS#. Other attributes such as Name may also be eligible as a global ID. The query
is shown below:
MULTISELECT *
FROM ROLE-SET(Faculty,Student)
USING GID (Faculty.* = Student.*) WITH SIMILARITY > 0.8 UP TO 3 SETS
The clause \USING GID (Faculty.� = Student.�) WITH SIMILARITY > 0.8 UP TO 3 SETS" speci�es
that the Semantic Integration Process should use top 3 sets of corresponding attributes with similarity
greater than 0.8 as candidate global IDs. The query is processed in the following steps:
Step 1: Semantic Integration. Semint recommends the following attribute correspondences:
(Faculty.SS#, Student.Stud_ID, similarity = 0.98)
(Faculty.Facu_Name, Student.Stud_Name, similarity = 0.91)
(Faculty.Salary, Student.Stipend, similarity = 0.85)
Step 2: Query Re-formulation. The \�"s are replaced by the attribute correspondences generated in the
step 1. Semint recommends three pairs of corresponding attributes; therefore, three multidatabase
queries are generated as follows:
MULTISELECT *
FROM ROLE-SET(Faculty,Student)
USING GID (Faculty.SS# = Student.Stud_ID)
MULTISELECT *
FROM ROLE-SET(Faculty,Student)
USING GID (Faculty.Facu_Name = Student.Stud_Name)
MULTISELECT *
FROM ROLE-SET(Faculty,Student)
USING GID (Faculty.Salary = Student.Stipend)
Step 3: Multidatabase Query Processing and Data Integration with Ranked Role Sets. The query results
using three candidate GIDs, three sets of role sets (shown in Figures 10-12, are then presented to
the user as \ranked role sets". The degree of con�dence of a role set is based on the similarity of
corresponding attributes used as GID.
In this example, the Data Integration Process generates three sets of results (role-sets) that comprise the
possible answers to the above query: Result of using (SS#,Stud ID) as GID with Con�dence = 0.98, re-
sult using (Facu Name,Stud Name) as GID with Con�dence = 0.91, and result using (Salary,Stipend)
17
as GID with Con�dence = 0.85. We can see that the result using Salary and Stipend as the global ID is
clearly incorrect, as no tuples are in the Faculty/Student role-set. We could tell that Mary is probably not
a student-instructor because she has two SSN and her salary is much higher than we expect for a student
instructor. Therefore (SS#,Stud ID) is a more likely global ID (the fact that this gives only tuples where
BOTH the name and ID are the same supports this conjecture).
7 Scalability and extensions
7.1 Neural Network Scalability
We have implemented and built neural networks to recognize thousands of attributes in very large databases.
The techniques we use to support the scalability of neural networks are as follows:
� Classi�cation: The classi�cation process is very e�cient; requiring less than a second for a thousand
attributes (on a moderately powerful workstation). We use a classi�er to categorize similar attributes
into clusters and use cluster centers rather than attributes to train neural networks. This can reduce
the training time by around 80% in our experiments. The reason is that networks can not be trained
with training data in which there are two di�erent expected output for the same input pattern (two
di�erent attributes with the same characteristics). This is also the reason why training takes much
longer time using some very similar attributes rather than few well distinct clusters as training data.
� Neural networks are highly e�cient at recognizing patterns; it is only training them to recognize
patterns that takes time. For example, on a SUN Sparc HS11, it takes 1 second to test a single
item against a 281 output node network (281 separate attributes in training). Much of this is setup
time; testing 50 items takes 1.9 seconds. Thus the only substantial computation requirement is
training. However, training needs to be done only once per database and note that �nding attribute
correspondences manually in a database of this size is almost an impossible task.
� We have also designed the neural network to improve training times. Some of the principles of our
neural network architecture design are as follows:
1. Three-layer neural network architecture (one input layer, one output layer, and one hidden layer
in the middle) which is capability of solving all non-linear problems yet reducing the computation
time which will increase as more layers are added.
2. The hidden layer consists of (N+M)/2 nodes; where N is the number of nodes in the input
layer (number of discriminators) and M is the number of nodes in the output layer (number
of categories in the training data). The number of nodes in the hidden layer can be arbitrary.
However, (N+M)/2 nodes tend to give the shortest training time in our experiments.
We can further reduce the neural network training time by building several smaller neural networks
rather than a large neural network. For example, we want to train neural networks to recognize
1000 distinct attributes with 20 discriminators; we can build 20 small neural networks (each of them
18
can recognize 50 distinct attributes). The number of connections within neural networks can be
substantially reduced from 20*(20+1000)/2*1000 (10.2 million) to 20*(20+50)/2*50*20 (0.7 million).
7.2 Ranked Role-Set Scalability
The following techniques can be used to support the scalability of our approach in order to reduce the
potentially large number of role-sets that need to be presented to the users when many databases need to
be integrated.
1. The clauses \WITH SIMILARITY > threshold UP TO m SETS" can be used to reduce the number
of ranked role-set generated. In Section 6.3, we specify a constraint as \WITH SIMILARITY > 0.8
UP TO 3 SETS". However, we can also specify it as \WITH SIMILARITY > 0.9 UP TO 3 SETS",
\WITH SIMILARITY > 0.8 UP TO 2 SETS", or \WITH SIMILARITY > 0.9 UP TO 2 SETS". In
all three cases, only (Faculty.SS#, Student.Stud ID) and (Faculty.Facu Name, Student.Stud Name)
will be considered as possible GIDs.
2. The ranked role-sets can be presented to users in an interactive mode rather than batch mode. In
the example shown in Section 6.3 three queries are executed at the same time and three role-sets
are presented to the user. This is a batch mode. In an interactive mode, only one query at a
time is executed (ordered based on degree of con�dence) and this role-set is presented to the user.
Therefore, the role-set using GID (Faculty.SS#, Student.Stud ID) will be �rst presented to the users.
If this is the correct answer, no more queries need to be executed. Otherwise, the query using GID
(Faculty.Facu Name, Student.Stud Name) will be executed, etc.
3. Users provide additional domain knowledge to eliminate some role-sets. The original query results
can be given to a �ltering program before presentation to the user. For example, the user knows that
Larry is a student instructor and provides this information to the system. Therefore, we can use it as
a constraint to eliminate the role-set using GID (Faculty.Salary,Student.Stipend) since Larry is not
a student instructor using this GID. The users can also provide information such as the scholarship
can not be greater than $40,000. This will eliminate the result using GID (Faculty.Facu Name,
Student.Stud Name).
7.3 Scenarios of Global ID Existence
What if we have more than two databases? For example, suppose we add a database STAFF to the
preceding example. Figure 13(A) shows the basic assumption of the role-set approach - a global ID is
available across all databases. In order to determine the GID availability databases are compared in a
pair-wise fashion. This can be done either manually or using Semint.
Figure 13(B) shows that there is a partial GID available since Faculty.SS#matches Student.Stud ID and
Student.Stud ID matches Sta�.SSN. We know these relationships because of manual semantic integration
or Semint (use Student database to train a neural network and the use Faculty and Sta� databases as
input). In this case, we know we can answer the query on tuples in area2. However, we can not answer
the query on tuples in area1.
19
STAFF
FACULTY
STUDENT
SS#
SSN Stud_ID
(A)
STAFF
FACULTY
STUDENT
SS#
Tel#
Stud_ID
Phone#
STAFF
FACULTY
STUDENT
SS#
Stud_IDSSN
(B)
AREA2 AREA2
AREA 1AREA 1
(C)
Figure 13: Scenarios of GID Existence
Now consider the following situations:
� The user asks a query based on tuples in area2: The system uses two partial global IDs (Fac-
ulty.SS#,Student.Stud ID) and (Student.Stud ID,Sta�.SSN) to answer the query without identifying
a partial global ID between Faculty and Sta� databases.
� The user asks a query based on tuples in area1: The system needs to identify a partial global ID
between Faculty and Sta� to combine with two partial global IDs (Faculty.SS#,Student.Stud ID) and
(Student.Stud ID,Sta�.SSN) to answer the query. Semint will then use either Faculty or Sta� to train
neural network and then identify corresponding attributes between Faculty and Sta� databases. With
an assumption that similarity can be propagated, the facts Faculty.SS# is similar to Student.Stud ID
and Student.Stud ID is similar to Sta�.SSN suggest that Faculty.SS# is similar to Sta�.SSN, scenario
in Figure 13(B) can be treated as scenario in Figure 13(A).
Figure 13(C) shows another scenario that we know there is a partial GID available since Faculty.SS#
matches Student.Stud ID and Student.Tel# matches Sta�.Phone# (there is no ID in Sta� database).
In this scenario we should combine two partial global IDs (Faculty.SS#,Student.Stud ID) and (Stu-
dent.Tel#,Sta�.Phone#) as a GID. Whether or not the system needs to identify the corresponding at-
tributes between Faculty and Sta� as a partial GID depends on what the query asks for (as discussed
above).
8 Conclusion and Future Work
A major problem in heterogeneous databases is determining how to handle information from di�erent
databases that refers to the same real-world entity. Performing this mapping before the combined in-
formation is needed is a di�cult task, and maintaining this integrated schema may not be worthwhile,
particularly if queries on multiple databases are infrequent. Multidatabase query languages allow this
mapping to be speci�ed as part of the query, by providing functions to manipulate di�erent data represen-
tations and merge the results from local databases. However, this still requires the user to determine the
needed mappings in advance, even if such mappings do not need to be part of the heterogeneous database
system. In this paper we present an approach of dynamic data integration in query processing with ranked
role sets in which the existence of a global ID is not assumed.
20
Our approach is dynamic integration: Allowing the mappings that combine information from di�erent
databases describing the same real-world entity to be determined after the query is issued. We have pre-
sented a method that uses ranked role-sets to present query results to the user based on likely attribute
correspondences between the databases. The user is presented with multiple query results ranked by the
degree of con�dence and only need to check and con�rm the results. We also discuss how a domain
knowledge-based �ltering program can further assist the user. Our approach can be considered semantic
integration using a query language. However, if attribute correspondences are available or semantic inte-
gration has been done before, our approach can be considered a multidatabase system based on role-set
since semantic integration (using Semint) only needs to be done once.
Future work includes developing an automated technique to identify partial global IDs and combine
them as a GID as discussed in Section 7.2.
References
[AKWS95] Shailesh Agarwal, Arthur M. Keller, Gio Wiederhold, and Krishna Saraswat. Flexible relation:
An approach for integrating data from multiple, possibly inconsistent databases. In Proceedings
of the 11th International Conference on Data Engineering, pages 495{504, Taipei, Taiwan,
March 1995. IEEE.
[BHP94] M. W. Bright, A. R. Hurson, and S. Pakzad. Automated resolution of semantic heterogeneity
in multidatabases. ACM Transactions on Database Systems, 19(2):212{253, June 1994.
[GMS94] Cheng Hian Goh, Stuart E. Madnick, and Michael D. Siegel. Context interchange: Over-
coming the challenges of large-scale interoperable database system. In Proceedings of the 3rd
International Conference on Information and Knowledge Management, pages 337{346. ACM,
November 1994.
[HDG84] H.Y. Hwang, U. Dayal, and M. Gouda. Using semiouterjoins to process queries in multidatabase
systems. In Proceedings of the Third ACM SIGACT-SIGMOD Symposium on Principles of
Database Systems, pages 153{162. ACM, April 1984.
[LAZ
+
89] W. Litwin, A. Abdellatif, A. Zeroual, B. Nicolas, and P. Vigier. MSQL: A multidatabase
language. Information Sciences, 49:59{101, 1989.
[LC94] Wen-Syan Li and Chris Clifton. Semantic integration in heterogeneous databases using neural
networks. In Proceedings of the 20th International Conference on Very Large Data Bases, pages
1{12, Santiago, Chile, September 12-15 1994. VLDB.
[LC95] Wen-Syan Li and Chris Clifton. Semint: A system prototype for semantic integration in
heterogeneous databases. In Proceedings of the 1995 ACM SIGMOD Conference, San Jose,
California, May23-25 1995.
21
[LMR90] W. Litwin, L. Mark, and N. Roussopoulos. Interoperability of multiple autonomous databases.
ACM Computing Surveys, 33:267{293, 1990.
[MH80] Dennis McLeod and Dennis Heimbigner. A federated architecture for database systems. In
Proceedings of the National Computer Conference, pages 283{289, Anaheim, CA, May 1980.
AFIPS.
[ON94] Aris M. Ouksel and C. F. Naiman. Coordinating context building in heterogeneous information
systems. Journal of Intelligent Information Systems, 3:151{183, 1994.
[PB94] William J. Premerlani and Michael R. Blaha. An approach for reverse engineering of relational
databases. Communications of the ACM, 37(5):42{49, May 1994.
[PS95] Christine Parent and Stefano Spaccapietra. Database integration: An overview of issues and
approaches. Submitted to Communications of the ACM, 1995.
[RR84] A. Rosenthal and D. Reiner. Extending the algebraic framework of query processing to handle
outerjoins. In Proceedings of 10th International Conference on Very Large Data Bases, pages
334{343, August 1984.
[SC94] Peter Scheuermann and Eugene I. Chong. Role-based query processing in multidatabase sys-
tems. In Proceedings of the International Conference on Extending Database Technology, pages
95{108, March 1994.
[SK92] Amit Sheth and Vipul Kashyap. So far (schematically) yet so near (semantically). In Pro-
ceedings of the IFIP TC2/WG2.6 Conference on Semantics of Interoperable Database Systems,
Victoria, Australia, November 1992.
[SL90] Amit Sheth and James Larson. Federated database systems for managing distributed het-
erogeneous, and autonomous databases. ACM Computing Surveys, 22(3):183{236, September
1990.
[ZSC95] J. Leon Zhao, Arie Segev, and Abhirup Chatterjee. A universal relation approach to fed-
erated database management. In Proceedings of the 11th International Conference on Data
Engineering, pages 261{270, Taipei, Taiwan, March 1995. IEEE.
22