Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann...

22

Transcript of Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann...

Page 1: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

Dynamic Integration and Query Processing with Ranked Role Sets

Peter Scheuermann

Dept. of EECS, Northwestern University

2145 Sheridan Road

Evanston, Illinois, 60208-3118

Email: [email protected]

Wen-Syan Li

y

Chris Clifton

z

CIMIC, Rutgers University The MITRE Corporation

180 University Ave K308, 202 Burlington Road

Newark, NJ 07102 Bedford, MA 01730

Email: [email protected] Email: [email protected]

Abstract

The role-set approach is a new conceptual framework for data integration in multidatabase systems

that maintains the materialization autonomy of local database systems and provides users with more

accurate information. The role-set approach presents the answer to a query as a set of relations where

the distinct intersections between the relations corresponding to the various roles played by an entity.

In this paper we show how the basic role-based approach can be extended in the absence of information

about the multidatabase keys (global IDs). We propose a strategy based on ranked role-sets that makes

use of a semantic integration procedure based on neural networks to determine candidate global IDs.

The data integration and query processing steps then produce a number of role-sets, ranked by the

similarity of the candidate IDs.

1 Introduction

The capability to make a large number of databases interoperable has become a crucial element in the

development of new information systems. The number of databases that may potentially cooperate in

a given organization can be very large; on the other hand a particular application is most likely to use

only a small subset of these databases. Multidatabase or federated database systems provide for the

interoperability of autonomous database systems without requiring their global integration [MH80, LMR90,

SL90, SC94, PS95].

In order to answer queries in multidatabase systems three distinct processes need to be performed by

the user, database administrator, and/or system as shown in Figure 1. The Schema Integration includes a

possible schema transformation step, followed by correspondence identi�cation, and an object integration

The author's work is supported by NSF grant IRI-9303583 and NASA grant NAG2-846.

y

This material is based upon work supported by the National Science Foundation under Grant No. CCR-9210704. The

work described in this paper was performed when the author was obtaining his Ph.D. at Northwestern University, Dept. of

EECS.

z

This material is based upon work supported by the National Science Foundation under Grant No. CCR-9210704. The

views and opinions in this paper are those of the author and do not re ect MITRE's work position.

1

Page 2: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

Multidatabase Query

Sub-Query

Tuples Tuples

Sub-Query

Data Integration

Users

Query Processing Integration

Global ID

DBAs

SchemaIntegratedSchema /Mapping

1 2Database

Component Component ComponentDatabase Database

N

Figure 1: Multidatabase Query Processing

and mapping construction step [PS95]. A major subproblem in the correspondence identi�cation step is

semantic integration: determining which attributes are equivalent between the databases [LC94]. In Query

Processing, global queries are reformulated into sub-queries, the sub-queries are executed at the local sites,

and their results are assembled at a �nal site. The Data Integration is complimentary to Query Processing,

i.e, it determines how the values from di�erent local databases should be merged or presented at the �nal

site. In fact, Query Processing is impacted by the approaches chosen for both Schema Integration and

Data Integration. For example, if structural di�erences are resolved via generalization [HDG84] and data

integration is performed via aggregate operators, the cost of local selections in query processing becomes

very expensive.

In [SC94] we introduced a new conceptual framework for data integration in multidatabase systems

that maintains the materialization autonomy of the local database systems involved and provides users

with more accurate information if they so desire. The role-set approach is based on the observation that

many con icting data values are not actually inconsistencies (as assumed in [AKWS95]), but values that

correspond to the same real world object appearing in multiple roles. The role-set method presents the

answer to a query as a set of relations representing the distinct intersections between the relations corre-

sponding to the various roles. A basic assumption of the role-set approach is that the a multidatabase key

(global ID) is known that can serve as a global object identi�er to relate the object instances corresponding

to a real world entity. In this paper we show how the basic role-based approach can be extended in the

absence of information about the multidatabase keys (global IDs). We propose a strategy based on ranked

role-sets that makes use of a semantic integration procedure based on neural networks (Semint [LC95])

to determine candidate global IDs with di�erent degrees of similarity. The Data Integration and Query

Processing steps then produce multiple role-sets, ranked by the similarity of the candidate global IDs.

1.1 The Problem

We illustrate the problem and our approach using the following example: Assume that our multidatabase

integrates two local schemas, namely FACULTY (SS#, Faculty Name, Salary) and STUDENT(Stud ID,

2

Page 3: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

Student: (Stud_ID, Stud_Name, Stipend, Tel#)

Faculty: (SS#, Facu_Name, Salary)

Global ID !similarity=0.98 similarity=0.21

Golbal ID ?

Figure 2: Possible Global IDs in Faculty and Student

Stud Name, Stipend, Tel#), as shown in Figure 2. Suppose that we want to �nd people with salaries

greater than $30,000. The salary of a student instructor may come from two sources, faculty salary and

student stipend. Since a student instructor appears in both FACULTY and STUDENT roles, the role-set

approach presents the answer to the query as three distinct intersections between the two schemas: The set

of those qualifying persons who appear only in the role FACULTY; the set of qualifying persons only in the

role STUDENT; and the set of persons who play both roles, FACULTY \ STUDENT. However, to create

the intersection FACULTY \ STUDENT, the user has to know the global ID that de�nes which items in

FACULTY and STUDENT refer to the same person. If a complete integration schema is not available and

the user doesn't have this knowledge, he may try to do this intersection on the global ID(Faculty.SS#,

Student.Tel#) since Tel# looks similar to SS#. The user then receives the surprising result of an empty

set.

Instead of this, using the ranked role-set approach the query �rst invokes an automated semantic

integration tool (Semint) [LC95] that produces a list of attributes in the STUDENT database that are

likely to correspond to FACULTY.SS#: STUD ID (with a similarity of 0.98) and Tel# (with a similarity

of 0.21). The Query Processing process will then issue two multidatabase queries that use (Faculty.SS#,

Student.Stud ID) and (Faculty.SS#, Student.Tel#) as global IDs, respectively. The user will be presented

with two role sets, ranked by their degrees of con�dence. Here we rank the degree of con�dence of a

result based on the similarity of corresponding attributes used for global ID. The �rst role set, constructed

using (Faculty.SS#, Student.Stud ID) as the global ID, consists of three intersections: FACULTY only,

STUDENT only, and FACULTY \ STUDENT with a high degree of con�dence (0.98). The second role

set, constructed using (Faculty.SS#, Student.Tel#) as global ID, consist of only two role sets: FACULTY

only and STUDENT only, as the FACULTY \ STUDENT set is empty. This result has a low degree of

con�dence (0.21). The second query using (Faculty.SS#, Student.Tel#) could have been avoided if the

user would have speci�ed a threshold for the similarity of attribute pairs.

The end user is able to distinguish between unreasonable and reasonable answers by examining the role

sets ranked with the appropriate degrees of con�dence. As a consequence, most likely he will come to the

correct conclusion that (Faculty.SS#, Student.Stud ID) is the global ID. In consequent queries the user

uses only GID(Faculty.SS#,Student.Stud ID) and no semantic integration needs to be carried out, as the

GID is known. As in [ON94], the user query provided the context for semantic integration. The user e�ort

involved in the integration is limited to identifying the databases and relations; the rest of the semantic

integration is automated.

3

Page 4: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

1.2 Paper Organization

The rest of this paper is organized as follows. In Section 2 we review related work in multidatabase

and federated database systems. Section 3 outlines our semantic integration tool (Semint). Section 4

describes the basic role set approach to data integration and query processing. In Section 5 we provide

a new framework for dynamic integration and query processing based on ranked role sets. In Section 6

we provide example scenarios to illustrate our approach. In Section 7 we discuss the scalability of our

approach and extensions to dealing with partial global IDs. Finally we give our concluding remarks in

Section 8.

2 Related Work

Early work in multidatabase architectures focused on procedures to merge individual schemas into a single

global conceptual schema. The global schema multidatabase approach requires complete integration; the

global schema must map all local schemas to a single global view. The amount of knowledge required

about local schemas and how to identify and resolve heterogeneity among the local schemas is a major

problem with this approach. The global schema must be developed before any queries can be issued.

Changes to local schemas must be re ected by corresponding changes in the global schema. This causes

major di�culties in maintaining the global schema. Because of the complexity of a global schema, a small

change to a local schema (e.g. add or delete an attribute) may require huge changes to the global schema.

[GMS94] argues that existing a-priori or static integration strategies might provide satisfactory support

for small or static systems, but not for large-scale interoperable database systems operating in a dynamic

environment.

Federated databases [MH80, PS95], only require partial integration, integrate a collection of loosely cou-

pled local database systems by supporting interoperability rather than through a complete global schema.

However, although the impact of a change to a local schema may be smaller, any change to the local

schema may require some change to the federated schema. Maintaining these mappings is still di�cult.

Multidatabase systems [HDG84, LAZ

+

89, SC94, BHP94] attempt to resolve the problems described

above by discarding completely with the need of a global or partial schema. This approach puts the

integration responsibility on the user by providing him with functionality beyond standard SQL in order

to specify some integration information as part of the query. MSQL [LAZ

+

89] was the �rst multidatabase

language proposed as an extension of SQL. The answer to a query in MSQL is a multi-relation: a set of

relations dynamically created by the query that come from di�erent local databases. Extensions to SQL

were also provided in the federated database context [ZSC95]. The universal relation model was used in

[ZSC95] in order to express the metadata in the federated schema, and thus reduce the di�culty of writing

queries in the federated SQL.

Although multidatabase system languages eliminate the need for a global or partial schema, some

important issues in the schema integration process remain to be solved. The three step schema integration

process described in [PS95], includes, as mentioned before, a pre-integration procedure for putting the local

schemas in more homogeneous format, correspondence identi�cation, and the actual integration procedure.

4

Page 5: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

Multidatabase languages [LAZ

+

89, SC94] eliminate the need to perform the last step, but the crucial

correspondence identi�cation is still required in order to specify Global IDs.

The process of identifying corresponding attributes is also referred as semantic integration and has

been recognized as one of the fundamental problems in database systems interoperability [SK92]. In

semantic integration, attributes (classes of data items) need to be compared pairwise to determine their

equivalence. [LC94] points out that three levels of metadata can be automatically extracted from the

the local databases and used subsequently to aid the semantic integration process: attribute names (the

dictionary level), schema information (the �eld speci�cation level) and data contents and statistics (the

data content level). As GM's e�orts have shown, the attribute name metadata was not su�cient for

semantic integration [PB94]; using only this level of metadata only a few obvious matches were found.

However, similarities in schema information were found to be useful. For example, it was discovered in one

case that attributes of type char(14) were equivalent to those of char(15) (with an appended blank). A

further distinction can be made between manual methods for semantic integration, versus automated or

semi-automated methods [LC94]. The semantic integration procedure based on neural networks (Semint

[LC95]) is an example of the later.

The methods used for resolving structural di�erences during schema integration impact the data in-

tegration process. Data integration is concerned with combining the data values that re ect the same

information for a given entity whose components may come from multiple local databases. Structural dif-

ferences have been resolved by one of the following methods: outerjoins [RR84], generalizations [HDG84],

multiple relations [LAZ

+

89], role-sets [SC94], exible relations [AKWS95], or universal relations [ZSC95].

The universal relation approach implies that expensive outerjoins are required for data integration. Gener-

alizations require that aggregate operations are used to resolve inconsistencies. We argue that is a violation

of materialization autonomy, namely that the views presented by the local database systems are not pre-

served in the answers given by the federated or multidatabase system. The approaches based on multiple

relations, role-sets, and exible relations all extend the concepts of the classical relational model in order

to deal with data inconsistencies. In the role-set approach real-world objects may belong to di�erent in-

tersections depending upon the number of roles for which they have materializations. In comparison, in

exible relations all real world objects are represented uniformly as clusters of tuples that may appear in

one ( exible) relation only.

It is important to further distinguish between static versus dynamic methods of integration. Static

schema integration requires that the integration is completed before any queries can be issued against the

multidatabase system. Most of the methods for semantic integration reviewed above are static. Some of

them are highly automated, but the semantic integration must be �nished before any queries are written.

We use the term static data integration to refer to those methods that determine in advance the method

to resolve data inconsistencies and hence all inconsistencies are resolved in the same way. Aggregation

applied in the case of generalizations is an example of a static data integration method [HDG84]. On the

other hand, role-sets [SC94] and multiple relations [LAZ

+

89] qualify as dynamic data integration methods.

The result of a query consists of multiple relations or sets; furthermore, the number of relations in which

a real-world object appears or the particular intersection of the role-set in which it belongs is determined

5

Page 6: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

dynamically by the number of distinct materializations (roles) that it possesses.

In this paper we combine role-sets with Semint [LC95] in order to perform semantic and data integration

dynamically. This reduces the a-priori or static e�ort required for schema integration in multidatabase

systems. The semantic integration is being performed incrementally, only when needed, thus no global

integration is required. This process is being performed as a by-product of query processing with the user

query providing the context for the semantic integration. Speci�cally, the query result presented to the

user is a sequence of ranked role-sets, one for each candidate global ID. The user examines these role-sets

in order to determine which makes most sense and hence determines the global IDs dynamically.

We will now give more details on Semint and the role-set approach and then discuss our dynamic

integration method.

3 Semantic Integration with Neural Networks

Neural networks have emerged as a powerful pattern recognition technique. They are useful in a wide

range of application. Unlike traditional programs, neural networks are trained, not programmed. Neural

networks act on data by detecting underlying organizations or clusters. For example, the input characters

can be grouped by detecting how they resemble one another. The networks learn the similarities among

patterns directly from the instances of them. That means neural networks can infer classi�cation without

prior knowledge of regularities.

Traditional algorithmic approaches are best for tasks where exact rules are easy to de�ne and perfect

accuracy is critical. Neural networks, on the other hand, have advantages when dealing with imperfect

data, classifying data without obvious rules, and discovering relationships between data. We feel that

neural networks are more suitable than traditional algorithms for determining the semantic equivalence

between a pair of attributes since:

� The availability of metadata and the semantics of terms may vary and the relationship between two

attributes is usually fuzzy;

� It is hard to de�ne and assign probabilities to rules for comparing aspects of two attributes. The

knowledge of how to match attributes needs to be discovered directly from data, not pre-programmed;

and

� Pre-de�ned rules and probabilities that work for one pair of databases may not work for other pairs

of databases. They need be adjusted dynamically.

Semint (SEMantic INTegrator) [LC95] is a system for semantic integration based on neural network

techniques. Figure 3 outlines the semantic integration procedure in Semint. In this procedure, DBMS

speci�c parsers extract metadata - schema information (such as data types, length, scale, precision, and

existence of constraints such as primary keys, foreign keys, candidate keys, value and range constraints,

disallowing null values, and access restrictions) and data content statistics from a small portion of sample

data (such as maximum, minimum, average (mean), variance, coe�cient of variance, existence of null

6

Page 7: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

DBMS

Specific

Parsers

Extract

Database

Information

Trained

Networks

Determine

Similarity

Between

Attributes

Classify

Attributes

And

Generate

Training

Data

Train

Networks

To

Recognize

Patterns

Similarity

Equivalent

Attributesand

Users

Check

and

Confirm

Results

Trained

Networks

Cluster

CentersData Contents

and Schema

Figure 3: Overview of the Semantic Integration Procedure in Semint

(A) (B)

3

M

1

2

M nodes

3

2

1

Input layer

Length

Datatype

Valueconstraint

Key

C

C

C

1

2

3

MC

Output layer

N

4

N nodes

Average

Employee.id#

Dept.employee

Payroll.SSN

Name

Address

Telphone#

X

X

X

X

X

1

2

3

4

N

Payroll.SSN

Dept.employee

Employee.id#

1

2

3 3

M

1

2

(N+M)/2 nodes

M nodes

2

1

Hidden layer

Input layer

0.92

0.05

Value

Length

Key

N nodes

N

4

Average

Datatype

Output layer

Name

Address

Categories:

Category 3:

Telephone#

Attribute:Health_Plan.Insured#

constraint

0.08

0.03

Figure 4: (A) Classi�er in Semint (B) Back-Propagation Neural Network Result

values, and existence of decimals). The metadata is then transformed into a single format (so they can

be compared). If a database contains 10 attributes and we extract 20 characteristics to describe the

\signatures" of these attributes, the parser output has 10 vectors, where each vector has 20 values in the

range of [0::1]. The details of normalization process is described in [LC94]. Then, a classi�er is used

to cluster attributes into categories in a single database. The classi�er output is used to train a neural

network to recognize these categories; this trained network can then determine similar attributes between

databases.

The only human input is to specify DBMS types and database connection information and to examine

and con�rm the output results. Other processes can be fully automated. Users are shown corresponding

attributes with similarity greater than threshold set by the users. The users can further specify the

maximum number of similar attributes to be retrieved (e.g. top 10 pairs with similarity > 0.8). Note that

the training of a neural network needs to be done only once; the actual use of the network to determine

attribute correspondences is very e�cient and can be done almost instantly.

Semint uses the Self-Organizing Map algorithm, an un-supervised learning algorithm, as the classi�er, to

categorize attributes within a single database. We have adapted this algorithm so that users can determine

how �ne these categories are by setting the radius of clusters rather than the number of categories; if

desired users may examine the output and adjust this radius to cluster like attributes together. The

output of the classi�er is the vectors of cluster center weights. As shown in Figure 4(A), \Employee.id#",

7

Page 8: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

\Dept.employee", and \Payroll.SSN" are clustered into one category since their input characteristics (and

real world meanings) are close to each other.

The output of the classi�er (M vectors) is then used as training data for a back-propagation network,

a supervised learning algorithm. The \supervision" is that target results are provided; however as these

target results are the output of the classi�er, no user supervision is needed. The similarities are a measure

of how close the vector describing an input attribute is to each of the vectors of the training data (cluster

centers). After the back-propagation network is trained, we present it with a new attribute (a vector) such

as \health Plan.Insured#" which is not in the training data. This trained neural network determines the

similarity between \health Plan.Insured#" and each of the M categories. In Figure 4(B) the network shows

that the input pattern \Insured#" is closest to the category 3 (\Employee.id#", \Dept.employee", and

\Payroll.SSN") with similarity=0.92 but it is not similar to other categories because of the low similarity.

The \distance function" for close is not pre-de�ned, but is learned directly from the database semantics

during the training process, and will vary depending on the information contained in the database (allowing

Semint to adjust itself to di�erent database domains). Therefore similarity does not correspond to a

percentage or �xed distance function, but is a domain-speci�c value that can be used to rank the likelihood

that two attributes re ect the same real-world information.

The performance of Semint on databases with a number of attributes less than 40 is satisfactory (on

an IBM RS/6000 Model 220) - less than 0.1 second CPU time to classify attributes, less than 7 seconds

to train neural networks, and less than 0.1 second to determine the similarity. The Recall is excellent

(100%) with Precision ranging from 90% to 100%. For detail experimental results, please see [LC94]. In

Section 7.1 we further discuss the scalability of using neural networks on large databases with hundreds of

attributes (an environment where manual integration is almost an impossible task.)

4 Role-Set Based Query Processing in Multidatabase Systems

One challenge in multidatabase query processing is merging intermediate results from heterogeneous local

databases. Because of local database autonomy, we are not able to change the data structures of local

databases or enforce local databases to prepare their results according to certain an uniform format or

structure. Therefore, data integration is necessary and needs to be carried out at the global site before

presenting �nal results to users. We argue that many of the so-called inconsistencies that appear in data

integration are not real inconsistencies, but re ect the fact that we are dealing with di�erent values that an

attribute can take for distinct roles of the same real-world entity. A new concept of data integration based

on role sets was �rst proposed in [SC94]. Using the role set approach a user has the option to specify how

the system should resolve inconsistencies, e.g., whether users want to see aggregate values for an attribute

or all the values derived by the individual systems.

4.1 Data Integration with Role Sets

In order to illustrate these concepts we consider the FACULTY and STUDENT example as shown in

Figure 5. The object with Global ID Z appears in both roles, while the other objects appear in only

8

Page 9: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

Z 60KW 25K

X 50K

STUDENT

Y 35K Z 30K

FACULTY

Figure 5: FACULTY and STUDENT Roles

Z 45K

: :

X

Y

50K W 25K Z

Z

60K

30K

FACULTY ONLY STUDENT ONLY FAC STU

FACULTY STUDENT

X

Y

Z

50K

35K

60K

Z 30K

.....

......

W 25K

35K

(A) GENERALIZATION APPROACH (B) MULTIPLE RELATIONS APPROACH

(C) ROLE-SET APPROACH

Figure 6: Approaches of Resolving Structural Di�erences

one role. Let us now consider the query \retrieve all persons with a salary greater than $30,000." When

schema/data integration is performed according to the generalization approach described in [HDG84], the

federated database is viewed as consisting of the generalized entity set (e.g. PERSON(ID,Salary)). In

addition, the inconsistency in data values is solved by de�ning an aggregate function such as average

over the overlapping data values as shown in Figure 6(A). An alternative approach has been taken in

[LAZ

+

89], where in response to the above query the user is presented with multiple relations, one for each

role, as shown in Figure 6(B). The problem with this approach is that tuples corresponding to real-world

objects that appear in both FACULTY and STUDENT roles, as happens with the object with GID=Z,

are scattered through both tables and the answer is hard to visualize.

Using the role-set approach, the user has the option to specify that the answer should be presented as

a set of sets (role-set) representing the distinct intersections between the various roles as shown in Figure

6(C). Notice that an intersection such as FACULTY \ STUDENT in Figure 6(C) contains, for each real-

world object, all the tuples representing the distinct materializations where that object appears. Hence, this

set does not correspond directly to a relation in the traditional sense, since it contains tuples corresponding

to di�erent schemas that are not union-compatible. We denote these entity-based intersections between

relations R

1

; R

2

; :::R

n

, corresponding to distinct roles, as a role-set, Role-set(R

1

; R

2

; :::; R

n

). Figure 7

illustrates conceptually the di�erent elements of Role-set(R

1

; R

2

; R

3

).

9

Page 10: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

R 1 R 2

R 3R 1 R 2 R 3

R 3 R 2 R 1 R 3 R 2

R 1 R

=

2 R 3

R 1

R 1

)(

( ) ( )

--

-__

Figure 7: Role-set of (R

1

; R

2

; R

3

)

4.2 Role-Set Based Query Formulation

We have de�ned extensions to MSQL that allow us to specify and manipulate role-sets. A role-set is created

via the MULTISELECT statement as illustrated in the following example:

MULTISELECT * FROM ROLE-SET(X,Y) USING GID

where GID is a global identi�er and X and Y are the roles according to which the various entity-based

intersections are created. The result of this statement is conceptually equivalent to executing the following

pseudo-SQL code:

(X-Y) : SELECT � FROM X WHERE GID NOT IN (SELECT ID FROM Y)

(Y-X) : SELECT � FROM Y WHERE GID NOT IN (SELECT ID FROM X)

(X\Y) : SELECT � FROM X,Y WHERE (X.GID = Y.GID) =

SELECT � FROM X WHERE GID IN (SELECT GID FROM Y) \+"

SELECT � FROM Y WHERE GID IN (SELECT GID FROM X)

Note that \+" stands for a pseudo-union that contains all the materializations of a real-world object. To this

role-set we can now apply modi�ed select, project, and aggregate operations as well as the r-join operation

that performs joins between elements of the role-set. At the end, after all operations are performed, the

result is transformed for presentation to the user in standard relational form, i.e., in the case of (X\Y),

it consists of tuples of the form (GID,X-attributes,Y-attributes). The modi�ed existential and universal

quanti�ers allow us to perform selection from the role-set in the following fashion. A WHERE clause with

a 9 condition will select all materializations of a real-world object in an intersection R

1

\ R

2

:::R

n

as long

as one of them satis�es the selection criteria. On the other hand, a clause with the 8 condition implies

that that the select condition must be satis�ed by all the materializations of an object.

Example. Assume that relations R and S represent two di�erent roles of an object and they contain tuples

with attributes (GID,Salary). The current extensions are R = (< 1; 50K >;< 2; 25K >;< 3; 12K >) and

S =(< 1; 25K >;< 3; 30K >). The intersection of R\S is performed with respect to GIDs : (R \ S)

ENT

= (< 1; 50K >;< 1; 25K >;< 3; 12K >;< 3; 30K >). A select(9) with respect to the quali�cation

Salary > 20K will return the set (R\ S)

ENT

since for each of the two objects at least one materialization

quali�es. On the other hand, a select(8) with respect to the same quali�cation will return the set (R\S)

0

= (< 1; 50K >;< 1; 25K >) since one of the materializations of the object with GID=3 does not qualify.

2

10

Page 11: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

4.3 Role-Set Based Query Processing

A major problem in query optimization in multidatabase systems is the lack of auxiliary access paths at

the global site where most relations handled are intermediate results. Hence it is important to devise

an e�cient query processing strategy for dealing with intermediate data. The intermediate data to be

generated at the global site consists of a private part and overlap parts, containing the identi�ers of the

corresponding tuples in the intersections de�ned by the role-set (plus optional aggregation attributes). In

the generalization approach of Hwang et al. [HDG84], n semi-outerjoin operations need to be performed

at the global site to �nd the private and overlap parts of n relations. In addition, their strategy su�ers

from high communication costs since local selection is possible only for the private parts of the query and,

in the case of aggregate queries, no local selection is possible at all.

We have developed a new strategy for query processing based on our role-based model [SC94] that aims

at minimizing the amount of data to be transmitted between the local sites and the global site as well as

reducing the processing costs required at the global site for dealing with intermediate data. Our strategy

makes e�ective use of merge-sort/scan to produce in one iteration the private part and the various overlap

parts of the query. The private part of the query consists of the set of GIDs that appear only in one role

and satisfy the query, while the overlap parts correspond to intersections containing GIDs in multiple roles.

Thus for Role-set(R

1

; R

2

; R

3

) the private part consists of the GIDs in (R

1

\ R

2

\R

3

) [ (R

1

\R

2

\ R

3

) [

(R

1

\R

2

\R

3

) that satisfy the query, while the overlap parts are the GIDs in R

1

\R

1

\R

3

; :::;R

1

\R

2

\R

3

that satisfy the query.

The basic role-based query processing algorithm is outlined below:

Step 1: Local sites send set of GIDs and subset of GIDs selected.

Step 2: The global site (GS) performs a merge-sort for each role Ri and one merge-scan to produce the

private part and overlap parts simultaneously. GS sends the private part and overlap parts to the

each local site.

Step 3: Local sites perform projection for queries without aggregation (projection and selection for ag-

gregation) and send target attributes (and optional r-join attributes) to GS.

Step 4: GS merges data from various sites into corresponding intersections (and merge-sorts it for an

r-join).

Step 5: Optional r-joins executed.

Note: By global site we mean that site in the federation that receives the query from the user. Any site

in the federation can act as the global site for a particular query.

5 Dynamic Integration with Ranked Role Sets

Dynamic data integration methods, such as role-sets, still require that a Global ID be known to serve

as local and global object identi�ers. We present an approach of dynamic data integration and query

11

Page 12: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

processing with ranked role sets in which we release the assumption that a global ID is known in advance.

The basic role-set approach can be extended in the absence of information about the multidatabase keys

(global ID). We see three scenarios for dynamic integration, depending on the level of knowledge of global

IDs that a user may have when issuing a query:

1. All databases are well understood and a global ID is known if it exists.

2. A subset of databases are well understood. A partial global ID is known across these databases,

however the corresponding attributes in other databases need to be determined to come up with a

global ID.

3. No databases is well understood; global IDs need to be determined \from scratch".

5.1 Query Language

Our approach works in all three scenarios, so users can submit a multidatabase query whether or not a

Global Id is known. This requires extensions to the role-set MSQL to allow specifying a Global ID, or

requesting that one be determined automatically. The extensions are shown as follows:

MULTISELECT attribute names

FROM ROLE-SET ( relation 1 relation 2 ... relationn )

USING GID ( relation 1.(attribute j �) = relation 2.(attribute j �) = . . .=

relationn.(attribute j �) [ WITH SIMILARITY > threshold ]

[ UP TO m SETS ] )

The \MULTISELECT" clause speci�es which attributes to retrieve. The \FROM ROLE-SET" clause

speci�es the roles according to which the various entity-based intersections are created (as described in

Section 4.2. The \USING GID" clause speci�es what is known about Global IDs, and what needs to be

determined. If the GID is known in advance (the �rst scenario), all attributes are speci�ed.

Data integration and query processing when the GID is known are as described in Section 4. In the

second scenario (one attributes is speci�ed as \�") and third scenario (both attributes are speci�ed as

\�"), Semantic Integration needs to be carried out to generate attribute correspondences as candidate

global IDs. The clauses \WITH SIMILARITY > threshold" and \UP TO m SETS" are optional, and

are used to restrict the number of potential GID candidates (e.g., \WITH SIMILARITY > 0.9 UP TO 5

SETS").

5.2 Dynamic Integration and Query Processing Procedures

The overview of our framework architecture is shown in Figure 8. The procedure is outlined below:

Pre-multidatabase-query process (semantic integration)

Step 1: The users submit a multidatabase query to retrieve semantically similar data items. The \USING

GID" clause speci�es the type of global ID assumption.

12

Page 13: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

DataIntegration

QueryProcessing

IntegrationSemantic

MultidatabaseQuery

Users

Attribute

Correspondence

2 N

ComponentDatabase

1

Component ComponentDatabase Database

RankedRole Sets

Figure 8: Dynamic Integration in Query Processing with Ranked Role Sets

Step 2: If the global ID is unknown, Semantic Integration Process (Semint) at the global site extracts

metadata from the local component databases.

Step 3: Semint uses the metadata extracted in Step 2 to generate attribute correspondences as candidate

GIDs according to the user-speci�ed or default similarity threshold.

Multidatabase Query Processing and Data Integration

Step 4: Multidatabase Query Processor re-formulates the original multidatabase query into multiple mul-

tidatabase queries according to attribute correspondences. One multidatabase query is generated for

each candidate GID from the attribute correspondences whose similarity is greater than the threshold.

Step 5: The multidatabase query processor generates sub-queries for each multidatabase query generated

at Step 4 and then submits sub-queries to local component databases.

Step 6: The local component databases return the result tuples of the sub-queries executed at the local

sites to the originating site.

Step 7: The Data Integration process merges the intermediate results from various local sites by consulting

the attribute correspondences. The results are presented to the users as role sets with degrees of

con�dence (ranked role sets). The degree of con�dence of a ranked role set is based on the similarity of

attribute correspondence used as a GID. One set of role sets is generated for each pair of corresponding

attributes.

Note: Steps 2 and 3 can be done in anticipation of possible queries to improve query response time and

only need to be done once.

13

Page 14: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

[ Faculty ] [ Student ]

SS# Facu_Name Salary Stud_ID Stud_Name Stipend Tel#

=========== ========== ======== =========== ========== ======= ========

493-45-8735 John $100,000 476-34-5748 Jason $11,000 674-7456

956-45-0456 Robert $80,000 958-46-3256 Steven $0 765-0945

849-45-0500 Mary $60,000 485-75-2374 Mary $0 767-5134

485-95-6784 Larry $20,000 485-95-6784 Larry $12,000 767-0900

------------------------------- ----------------------------------------

Figure 9: Faculty and Student Databases

6 Example Scenario

In this section, we use some sample queries to demonstrate the dynamic integration and query processing

of our approach. Imagine that we are planning a university budget. We want to know

What are the salaries of student instructors?

The salary of a student instructor may come from two sources: Faculty salary from the University and

student stipend from the Graduate School. The faculty salary information is stored in the Faculty database

and student stipend information is stored in the Student database, as shown in Figure 9. Here we only

list relevant attributes for ease of illustration.

As we discussed in Section 5, we see three scenarios based on database knowledge: all databases are well

understood so that a global ID is known, only subset of databases are well understood (only a partial global

ID is known), and all databases are unknown (global IDs need to be determined by semantic integration).

6.1 Global ID is Known

In the �rst scenario, the user knows that Faculty.SS# and Student.Stud ID form a valid global identi�er.

The query can be posed as:

MULTISELECT *

FROM ROLE-SET(Faculty,Student)

USING GID (Faculty.SS# = Student.Stud_ID)

The result of this query is shown in Figure 10. The semantic integration does not to be carried out because

GID is known. The data integration and query processing is dynamic because users can specify how to

generate the role set to revolve structural di�erences.

6.2 ID is known for Faculty Database

In this section we discuss how our approach works in the second scenario: Only subset of databases are

understood. A partial global ID is known, however the global ID, if exists, needs to be determined by

semantic integration before the query can be executed. We are familiar with Faculty; however, we have

little knowledge about Student. We know the two databases should contain some similar data items such

as salary, social security number, and name. We can submit the follow multidatabase query to retrieve the

salaries of student instructors using Faculty.SS# be used as the part of global ID. Because the Student

14

Page 15: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

[ Faculty only role-set ] [ Student only role-set ]

SS# Facu_Name Salary Stud_ID Stud_Name Stipend Tel#

=========== ========== ======== =========== ========== ======= ========

493-45-8735 John $100,000 476-34-5748 Jason $11,000 674-7456

------------------------------- ---------------------------------------

956-45-0456 Robert $80,000 958-46-3256 Steven $0 765-0945

------------------------------- ---------------------------------------

849-45-0500 Mary $60,000 485-75-2374 Mary $0 767-5134

------------------------------- ---------------------------------------

[ Faculty and Student role-set ]

Facu_Name Salary

(SS#=Stud_ID) Stud_Name Stipend Tel#

============= ========== ======= ========

485-95-6784 Larry $20,000

Larry $12,000 767-0900

-----------------------------------------

Figure 10: Answer of Using (SS#,Stud ID) as GID with Degree of Con�dence = 0.98

database is not well understood, we specify the corresponding attribute in Student as \�". Semint will

then determine the possible corresponding attributes in Student to use with Faculty.SS# as the GID.

The query follows:

MULTISELECT *

FROM ROLE-SET(Faculty,Student)

USING GID (Faculty.SS# = Student.* WITH SIMILARITY > 0.8 UP TO 3 SETS)

The clause \USING GID (Faculty.SS# = Student.�)" causes the Semantic Integration Process to �nd

candidate corresponding attributes in Student database to be combined with Faculty.SS# as a global

ID. The clause \WITH SIMILARITY > 0.8" restricts candidate attributes to those that have a degree of

similarity greater than 0.8. The query is processed in the following steps:

Step 1: Semantic Integration. Semint recommends the attribute correspondence as:

(Faculty.SS#, Student.Stud_ID, similarity = 0.98)

Step 2: Query Re-formulation. The \�" is replaced by the corresponding attribute (Student.Stud ID)

found in the previous step. However, if multiple corresponding attributes are recommended by

Semint, one multidatabase query is generated for each corresponding attribute.

MULTISELECT *

FROM ROLE-SET(Faculty,Student)

USING GID (Faculty.SS# = Student.Stud_ID)

Step 3: Multidatabase Query Processing and Data Integration with Ranked Role Sets. Because Semint

only recommends one candidate GID, only one set of role sets is presented to the user. The result is

shown in Figure 10.

15

Page 16: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

[ Faculty only role-set ] [ Student only role-set ]

SS# Facu_Name Salary Stud_ID Stud_Name Stipend Tel#

=========== ========== ======== =========== ========== ======= ========

493-45-8735 John $100,000 476-34-5748 Jason $11,000 674-7456

------------------------------- ---------------------------------------

956-45-0456 Robert $80,000 958-46-3256 Steven $0 765-0945

------------------------------- ---------------------------------------

[ Faculty and Student role-set ]

SS# Salary

(Facu_Name=Stud_Name) Stud_ID Stipend Tel#

===================== =========== ======= ========

Larry 485-95-6784 $20,000

485-95-6784 $12,000 767-0900

--------------------------------------------------

Mary 849-45-0500 $60,000

485-75-2374 $0 767-5134

--------------------------------------------------

Figure 11: Result of Using (Facu Name,Stud Name) as GID with Degree of Con�dence = 0.91

[ Faculty only role-set ] [ Student only role-set ]

SS# Facu_Name Salary Stud_ID Stud_Name Stipend Tel#

=========== ========== ======== =========== ========== ======= ========

493-45-8735 John $100,000 476-34-5748 Jason $11,000 674-7456

------------------------------- ----------------------------------------

956-45-0456 Robert $80,000 958-46-3256 Steven $0 765-0945

------------------------------- ----------------------------------------

849-45-0500 Mary $60,000 485-75-2374 Mary $0 767-5134

------------------------------- ----------------------------------------

485-95-6784 Larry $20,000 485-95-6784 Larry $12,000 767-0900

------------------------------- ----------------------------------------

[ Faculty and Student role-set ]

SS# Facu_Name

Salary=Stipend Stud_ID Stud_Name Tel#

============== ========== ========= ========

--------------------------------------------

Figure 12: Result of Using (Salary,Stipend) as GID with Degree of Con�dence = 0.85

16

Page 17: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

6.3 Nothing is known about the Global ID

The query in previous section would not give a correct result if there is no attribute in Student corre-

sponding to Faculty.SS#. Other attributes such as Name may also be eligible as a global ID. The query

is shown below:

MULTISELECT *

FROM ROLE-SET(Faculty,Student)

USING GID (Faculty.* = Student.*) WITH SIMILARITY > 0.8 UP TO 3 SETS

The clause \USING GID (Faculty.� = Student.�) WITH SIMILARITY > 0.8 UP TO 3 SETS" speci�es

that the Semantic Integration Process should use top 3 sets of corresponding attributes with similarity

greater than 0.8 as candidate global IDs. The query is processed in the following steps:

Step 1: Semantic Integration. Semint recommends the following attribute correspondences:

(Faculty.SS#, Student.Stud_ID, similarity = 0.98)

(Faculty.Facu_Name, Student.Stud_Name, similarity = 0.91)

(Faculty.Salary, Student.Stipend, similarity = 0.85)

Step 2: Query Re-formulation. The \�"s are replaced by the attribute correspondences generated in the

step 1. Semint recommends three pairs of corresponding attributes; therefore, three multidatabase

queries are generated as follows:

MULTISELECT *

FROM ROLE-SET(Faculty,Student)

USING GID (Faculty.SS# = Student.Stud_ID)

MULTISELECT *

FROM ROLE-SET(Faculty,Student)

USING GID (Faculty.Facu_Name = Student.Stud_Name)

MULTISELECT *

FROM ROLE-SET(Faculty,Student)

USING GID (Faculty.Salary = Student.Stipend)

Step 3: Multidatabase Query Processing and Data Integration with Ranked Role Sets. The query results

using three candidate GIDs, three sets of role sets (shown in Figures 10-12, are then presented to

the user as \ranked role sets". The degree of con�dence of a role set is based on the similarity of

corresponding attributes used as GID.

In this example, the Data Integration Process generates three sets of results (role-sets) that comprise the

possible answers to the above query: Result of using (SS#,Stud ID) as GID with Con�dence = 0.98, re-

sult using (Facu Name,Stud Name) as GID with Con�dence = 0.91, and result using (Salary,Stipend)

17

Page 18: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

as GID with Con�dence = 0.85. We can see that the result using Salary and Stipend as the global ID is

clearly incorrect, as no tuples are in the Faculty/Student role-set. We could tell that Mary is probably not

a student-instructor because she has two SSN and her salary is much higher than we expect for a student

instructor. Therefore (SS#,Stud ID) is a more likely global ID (the fact that this gives only tuples where

BOTH the name and ID are the same supports this conjecture).

7 Scalability and extensions

7.1 Neural Network Scalability

We have implemented and built neural networks to recognize thousands of attributes in very large databases.

The techniques we use to support the scalability of neural networks are as follows:

� Classi�cation: The classi�cation process is very e�cient; requiring less than a second for a thousand

attributes (on a moderately powerful workstation). We use a classi�er to categorize similar attributes

into clusters and use cluster centers rather than attributes to train neural networks. This can reduce

the training time by around 80% in our experiments. The reason is that networks can not be trained

with training data in which there are two di�erent expected output for the same input pattern (two

di�erent attributes with the same characteristics). This is also the reason why training takes much

longer time using some very similar attributes rather than few well distinct clusters as training data.

� Neural networks are highly e�cient at recognizing patterns; it is only training them to recognize

patterns that takes time. For example, on a SUN Sparc HS11, it takes 1 second to test a single

item against a 281 output node network (281 separate attributes in training). Much of this is setup

time; testing 50 items takes 1.9 seconds. Thus the only substantial computation requirement is

training. However, training needs to be done only once per database and note that �nding attribute

correspondences manually in a database of this size is almost an impossible task.

� We have also designed the neural network to improve training times. Some of the principles of our

neural network architecture design are as follows:

1. Three-layer neural network architecture (one input layer, one output layer, and one hidden layer

in the middle) which is capability of solving all non-linear problems yet reducing the computation

time which will increase as more layers are added.

2. The hidden layer consists of (N+M)/2 nodes; where N is the number of nodes in the input

layer (number of discriminators) and M is the number of nodes in the output layer (number

of categories in the training data). The number of nodes in the hidden layer can be arbitrary.

However, (N+M)/2 nodes tend to give the shortest training time in our experiments.

We can further reduce the neural network training time by building several smaller neural networks

rather than a large neural network. For example, we want to train neural networks to recognize

1000 distinct attributes with 20 discriminators; we can build 20 small neural networks (each of them

18

Page 19: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

can recognize 50 distinct attributes). The number of connections within neural networks can be

substantially reduced from 20*(20+1000)/2*1000 (10.2 million) to 20*(20+50)/2*50*20 (0.7 million).

7.2 Ranked Role-Set Scalability

The following techniques can be used to support the scalability of our approach in order to reduce the

potentially large number of role-sets that need to be presented to the users when many databases need to

be integrated.

1. The clauses \WITH SIMILARITY > threshold UP TO m SETS" can be used to reduce the number

of ranked role-set generated. In Section 6.3, we specify a constraint as \WITH SIMILARITY > 0.8

UP TO 3 SETS". However, we can also specify it as \WITH SIMILARITY > 0.9 UP TO 3 SETS",

\WITH SIMILARITY > 0.8 UP TO 2 SETS", or \WITH SIMILARITY > 0.9 UP TO 2 SETS". In

all three cases, only (Faculty.SS#, Student.Stud ID) and (Faculty.Facu Name, Student.Stud Name)

will be considered as possible GIDs.

2. The ranked role-sets can be presented to users in an interactive mode rather than batch mode. In

the example shown in Section 6.3 three queries are executed at the same time and three role-sets

are presented to the user. This is a batch mode. In an interactive mode, only one query at a

time is executed (ordered based on degree of con�dence) and this role-set is presented to the user.

Therefore, the role-set using GID (Faculty.SS#, Student.Stud ID) will be �rst presented to the users.

If this is the correct answer, no more queries need to be executed. Otherwise, the query using GID

(Faculty.Facu Name, Student.Stud Name) will be executed, etc.

3. Users provide additional domain knowledge to eliminate some role-sets. The original query results

can be given to a �ltering program before presentation to the user. For example, the user knows that

Larry is a student instructor and provides this information to the system. Therefore, we can use it as

a constraint to eliminate the role-set using GID (Faculty.Salary,Student.Stipend) since Larry is not

a student instructor using this GID. The users can also provide information such as the scholarship

can not be greater than $40,000. This will eliminate the result using GID (Faculty.Facu Name,

Student.Stud Name).

7.3 Scenarios of Global ID Existence

What if we have more than two databases? For example, suppose we add a database STAFF to the

preceding example. Figure 13(A) shows the basic assumption of the role-set approach - a global ID is

available across all databases. In order to determine the GID availability databases are compared in a

pair-wise fashion. This can be done either manually or using Semint.

Figure 13(B) shows that there is a partial GID available since Faculty.SS#matches Student.Stud ID and

Student.Stud ID matches Sta�.SSN. We know these relationships because of manual semantic integration

or Semint (use Student database to train a neural network and the use Faculty and Sta� databases as

input). In this case, we know we can answer the query on tuples in area2. However, we can not answer

the query on tuples in area1.

19

Page 20: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

STAFF

FACULTY

STUDENT

SS#

SSN Stud_ID

(A)

STAFF

FACULTY

STUDENT

SS#

Tel#

Stud_ID

Phone#

STAFF

FACULTY

STUDENT

SS#

Stud_IDSSN

(B)

AREA2 AREA2

AREA 1AREA 1

(C)

Figure 13: Scenarios of GID Existence

Now consider the following situations:

� The user asks a query based on tuples in area2: The system uses two partial global IDs (Fac-

ulty.SS#,Student.Stud ID) and (Student.Stud ID,Sta�.SSN) to answer the query without identifying

a partial global ID between Faculty and Sta� databases.

� The user asks a query based on tuples in area1: The system needs to identify a partial global ID

between Faculty and Sta� to combine with two partial global IDs (Faculty.SS#,Student.Stud ID) and

(Student.Stud ID,Sta�.SSN) to answer the query. Semint will then use either Faculty or Sta� to train

neural network and then identify corresponding attributes between Faculty and Sta� databases. With

an assumption that similarity can be propagated, the facts Faculty.SS# is similar to Student.Stud ID

and Student.Stud ID is similar to Sta�.SSN suggest that Faculty.SS# is similar to Sta�.SSN, scenario

in Figure 13(B) can be treated as scenario in Figure 13(A).

Figure 13(C) shows another scenario that we know there is a partial GID available since Faculty.SS#

matches Student.Stud ID and Student.Tel# matches Sta�.Phone# (there is no ID in Sta� database).

In this scenario we should combine two partial global IDs (Faculty.SS#,Student.Stud ID) and (Stu-

dent.Tel#,Sta�.Phone#) as a GID. Whether or not the system needs to identify the corresponding at-

tributes between Faculty and Sta� as a partial GID depends on what the query asks for (as discussed

above).

8 Conclusion and Future Work

A major problem in heterogeneous databases is determining how to handle information from di�erent

databases that refers to the same real-world entity. Performing this mapping before the combined in-

formation is needed is a di�cult task, and maintaining this integrated schema may not be worthwhile,

particularly if queries on multiple databases are infrequent. Multidatabase query languages allow this

mapping to be speci�ed as part of the query, by providing functions to manipulate di�erent data represen-

tations and merge the results from local databases. However, this still requires the user to determine the

needed mappings in advance, even if such mappings do not need to be part of the heterogeneous database

system. In this paper we present an approach of dynamic data integration in query processing with ranked

role sets in which the existence of a global ID is not assumed.

20

Page 21: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

Our approach is dynamic integration: Allowing the mappings that combine information from di�erent

databases describing the same real-world entity to be determined after the query is issued. We have pre-

sented a method that uses ranked role-sets to present query results to the user based on likely attribute

correspondences between the databases. The user is presented with multiple query results ranked by the

degree of con�dence and only need to check and con�rm the results. We also discuss how a domain

knowledge-based �ltering program can further assist the user. Our approach can be considered semantic

integration using a query language. However, if attribute correspondences are available or semantic inte-

gration has been done before, our approach can be considered a multidatabase system based on role-set

since semantic integration (using Semint) only needs to be done once.

Future work includes developing an automated technique to identify partial global IDs and combine

them as a GID as discussed in Section 7.2.

References

[AKWS95] Shailesh Agarwal, Arthur M. Keller, Gio Wiederhold, and Krishna Saraswat. Flexible relation:

An approach for integrating data from multiple, possibly inconsistent databases. In Proceedings

of the 11th International Conference on Data Engineering, pages 495{504, Taipei, Taiwan,

March 1995. IEEE.

[BHP94] M. W. Bright, A. R. Hurson, and S. Pakzad. Automated resolution of semantic heterogeneity

in multidatabases. ACM Transactions on Database Systems, 19(2):212{253, June 1994.

[GMS94] Cheng Hian Goh, Stuart E. Madnick, and Michael D. Siegel. Context interchange: Over-

coming the challenges of large-scale interoperable database system. In Proceedings of the 3rd

International Conference on Information and Knowledge Management, pages 337{346. ACM,

November 1994.

[HDG84] H.Y. Hwang, U. Dayal, and M. Gouda. Using semiouterjoins to process queries in multidatabase

systems. In Proceedings of the Third ACM SIGACT-SIGMOD Symposium on Principles of

Database Systems, pages 153{162. ACM, April 1984.

[LAZ

+

89] W. Litwin, A. Abdellatif, A. Zeroual, B. Nicolas, and P. Vigier. MSQL: A multidatabase

language. Information Sciences, 49:59{101, 1989.

[LC94] Wen-Syan Li and Chris Clifton. Semantic integration in heterogeneous databases using neural

networks. In Proceedings of the 20th International Conference on Very Large Data Bases, pages

1{12, Santiago, Chile, September 12-15 1994. VLDB.

[LC95] Wen-Syan Li and Chris Clifton. Semint: A system prototype for semantic integration in

heterogeneous databases. In Proceedings of the 1995 ACM SIGMOD Conference, San Jose,

California, May23-25 1995.

21

Page 22: Sub-Query · Dynamic In tegration and Query Pro cessing with Rank ed Role Sets P eter Sc heuermann Dept. of EECS, North w estern Univ ersit y 2145 Sheridan Road Ev anston, Illi nois,

[LMR90] W. Litwin, L. Mark, and N. Roussopoulos. Interoperability of multiple autonomous databases.

ACM Computing Surveys, 33:267{293, 1990.

[MH80] Dennis McLeod and Dennis Heimbigner. A federated architecture for database systems. In

Proceedings of the National Computer Conference, pages 283{289, Anaheim, CA, May 1980.

AFIPS.

[ON94] Aris M. Ouksel and C. F. Naiman. Coordinating context building in heterogeneous information

systems. Journal of Intelligent Information Systems, 3:151{183, 1994.

[PB94] William J. Premerlani and Michael R. Blaha. An approach for reverse engineering of relational

databases. Communications of the ACM, 37(5):42{49, May 1994.

[PS95] Christine Parent and Stefano Spaccapietra. Database integration: An overview of issues and

approaches. Submitted to Communications of the ACM, 1995.

[RR84] A. Rosenthal and D. Reiner. Extending the algebraic framework of query processing to handle

outerjoins. In Proceedings of 10th International Conference on Very Large Data Bases, pages

334{343, August 1984.

[SC94] Peter Scheuermann and Eugene I. Chong. Role-based query processing in multidatabase sys-

tems. In Proceedings of the International Conference on Extending Database Technology, pages

95{108, March 1994.

[SK92] Amit Sheth and Vipul Kashyap. So far (schematically) yet so near (semantically). In Pro-

ceedings of the IFIP TC2/WG2.6 Conference on Semantics of Interoperable Database Systems,

Victoria, Australia, November 1992.

[SL90] Amit Sheth and James Larson. Federated database systems for managing distributed het-

erogeneous, and autonomous databases. ACM Computing Surveys, 22(3):183{236, September

1990.

[ZSC95] J. Leon Zhao, Arie Segev, and Abhirup Chatterjee. A universal relation approach to fed-

erated database management. In Proceedings of the 11th International Conference on Data

Engineering, pages 261{270, Taipei, Taiwan, March 1995. IEEE.

22