MapMerge: correlating independent schema mappings

21

Click here to load reader

Transcript of MapMerge: correlating independent schema mappings

Page 1: MapMerge: correlating independent schema mappings

The VLDB Journal (2012) 21:191–211DOI 10.1007/s00778-012-0264-z

SPECIAL ISSUE PAPER

MapMerge: correlating independent schema mappings

Bogdan Alexe · Mauricio Hernández · Lucian Popa · Wang-Chiew Tan

Received: 22 February 2011 / Revised: 17 November 2011 / Accepted: 17 January 2012 / Published online: 18 February 2012© Springer-Verlag 2012

Abstract One of the main steps toward integration orexchange of data is to design the mappings that describe the(often complex) relationships between the source schemasor formats and the desired target schema. In this paper, weintroduce a new operator, called MapMerge, that can be usedto correlate multiple, independently designed schema map-pings of smaller scope into larger schema mappings. Thisallows a more modular construction of complex mappingsfrom various types of smaller mappings such as schemacorrespondences produced by a schema matcher or pre-exist-ing mappings that were designed by either a human user orvia mapping tools. In particular, the new operator also enablesa new “divide-and-merge” paradigm for mapping creation,where the design is divided (on purpose) into smaller com-ponents that are easier to create and understand and whereMapMerge is used to automatically generate a meaningfuloverall mapping. We describe our MapMerge algorithm anddemonstrate the feasibility of our implementation on severalreal and synthetic mapping scenarios. In our experiments, wemake use of a novel similarity measure between two data-base instances with different schemas that quantifies the pres-ervation of data associations. We show experimentally thatMapMerge improves the quality of the schema mappings,

B. Alexe (B) · M. Hernández · L. Popa · W.-C. TanIBM Research-Almaden, San Jose, CA, USAe-mail: [email protected]

M. Hernándeze-mail: [email protected]

L. Popae-mail: [email protected]

W.-C. Tane-mail: [email protected]

W.-C. TanUC Santa Cruz, Santa Cruz, CA, USA

by significantly increasing the similarity between the inputsource instance and the generated target instance. Finally,we provide a new algorithm that combines MapMerge withschema mapping composition to correlate flows of schemamappings.

Keywords Schema mappings · Data exchange · Dataintegration

1 Introduction

Schema mappings are essential building blocks for informa-tion integration. One of the main steps in the integration orexchange of data is to design the mappings that describe thedesired relationships between the various source schemas orsource formats and the target schema. Once the mappings areestablished, they can be used either to support query answer-ing on the (virtual) target schema, a process that is tradition-ally called data integration [14], or to physically transformthe source data into the target format, a process referred toas data exchange [8]. In this paper, we focus on the dataexchange aspect although our mapping generation methodswill be equally applicable to derive mappings for data inte-gration.

Many commercial data transformation systems such asAltova Mapforce1 and Stylus Studio2 to name a few, aswell as research prototypes such as Clio [7] or HePToX [4],include mapping design tools that can be used by a humanuser to derive the data transformation program between asource schema and a target schema. Most of these toolswork in two steps. First, a visual interface is used to solicitall known correspondences of elements between the two

1 http://www.altova.com.2 http://www.stylusstudio.com.

123

Page 2: MapMerge: correlating independent schema mappings

192 B. Alexe et al.

Fig. 1 a A transformation flowfrom S1 to S3. b Schemamappings from S1 to S2.c Output of MapMerge

(a)

(b) (c)

schemas from the mapping designer. Such correspondencesare usually depicted as arrows between the attributes of thesource and target schemas. (See e.g., Fig. 1a). Sometimes, aschema matching module [20] is used to suggest or derivecorrespondences. Once the correspondences are established,the system interprets them into an executable script, suchas an XQuery or SQL query, which can then transform aninstance of the source schema into an instance of the targetschema. The generated transformation script is usually closeto the desired specification. In most cases, however, portionsof the script still need to be refined to obtain the desiredspecification.

We note that for both Clio and HePToX, the correspon-dences are first compiled into internal schema mapping asser-tions or schema mappings in short, which are high-level,declarative, constraint-like statements [13]. These schemamappings are then compiled into the executable script. Oneadvantage of using schema mappings as an intermediate formis that they are more amenable to the formal study of dataexchange and data integration [13], as well as to optimization

and automatic reasoning. In fact, the main technique that weintroduce in this paper represents a form of automatic rea-soning on top of schema mappings.

An important drawback of the previously outlined two-step schema mapping design paradigm is that it is hard toretro-fit any pre-existing or user-customized mappings backinto the mapping tool, since the mapping tool is based oncorrespondences. Thus, if the mapping tool is restarted, itwill regenerate the same fixed transformation script based onthe input correspondences, even though some portions of thetransformation task may have already been refined (custom-ized) by the user or may already exist (as part of a previoustransformation task, for example).

In this paper, we propose a radically different approach tothe design of schema mappings where the mapping tool cantake as input arbitrary mapping assertions and not just cor-respondences. This allows for the modular construction ofcomplex and larger mappings from various “smaller” map-pings that include schema correspondences and also arbitrarypre-existing or customized mappings. An essential ingredient

123

Page 3: MapMerge: correlating independent schema mappings

Correlating independent schema mappings 193

of this approach is a new operator on schema mappings thatwe call MapMerge and that can be used to automaticallycorrelate the input mappings in a meaningful way.

1.1 Motivating example and overview

To illustrate the ideas, consider first a mapping scenariobetween the schemas S1 and S2 shown in the left part ofFig. 1a. The goal is data restructuring from two source rela-tions, Group and Works, to three target relations, Emp, Dept,and Proj. In this example, Group (similar to Dept) representsgroups of scientists sharing a common area (e.g., a databasegroup, a CS group). The dotted arrows represent foreign keyconstraints in the schemas.

Independent mappings. Assume the existence of the fol-lowing (independent) schema mappings from S1 to S2. Thefirst mapping is the constraint t1 in Fig. 1b and correspondsto the arrow t1 in Fig. 1a. This constraint requires every tuplein Group to be mapped to a tuple in Dept such that the groupname (gname) becomes department name (dname). The sec-ond mapping is more complex and corresponds to the groupof arrows t2 in Fig. 1a. This constraint involves a custom filtercondition; every pair of joining tuples of Works and Groupfor which the addr value is “NY” must be mapped into twotuples of Emp and Dept, sharing the same did value, andwith corresponding ename, addr and dname values. (Notethat did is a target-specific field that must exist and playsthe role of key/foreign key). Intuitively, t2 illustrates a pre-existing mapping that a user may have spent time in the pastto create. Finally, the third constraint in Fig. 1b correspondsto the arrow t3 and maps pname from Works to Proj. Thisis an example of a correspondence that is introduced by auser after loading t1 and the pre-existing mapping t2 into themapping tool.

The goal of the system is now to (re)generate a “good”overall schema mapping from S1 to S2 based on its inputmappings. We note first that the input mappings, whenconsidered in isolation, do not generate an ideal targetinstance.

Indeed, consider the source instance I in Fig. 2. The tar-get instance that is obtained by minimally enforcing the con-straints {t1, t2, t3} is the instance J1 also shown in the figure.

The first Dept tuple is obtained by applying t1 on the Grouptuple (123, C S). There, D1 represents some did value thatmust be associated with C S in this tuple. Similarly, the Projtuple with some unspecified value B for budget and a didvalue of D3 is obtained via t3. The Emp tuple together withthe second Dept tuple are obtained based on t2. As required byt2, these tuples are linked via the same did value D2. Finally,to obtain a target instance that satisfies all the foreign key con-straints, we must also have a third tuple in Dept that includesD3 together with some unspecified department name N .

Since the three mapping constraints are not correlated, thethree did values (D1, D2, and D3) are distinct. (There is norequirement that they must be equal). As a result, the tar-get instance J1 exhibits the typical problems that arise whenuncorrelated mappings are used to transform data: (1) dupli-cation of data (e.g., multiple Dept tuples for C S with differentdid values), and (2) loss of associations where tuples are notlinked correctly to each other (e.g., we have lost the associ-ation between project name W eb and department name C Sthat existed in the source).

Correlated mappings via MapMerge. Consider now theschema mappings that are shown in Fig. 1c and that are theresult of MapMerge applied on {t1, t2, t3}. The notable dif-ference from the input mappings is that all mappings con-sistently use the same expression, namely the Skolem termF[g] where g denotes a distinct Group tuple, to give val-ues for the did field. The first mapping is the same as t1 butmakes explicit the fact that did is F[g]. This mapping cre-ates a unique Dept tuple for each distinct Group tuple. Thesecond mapping is (almost) like t2 with the additional useof the same Skolem term F[g]. Moreover, it also drops theexistence requirement for Dept (since this is now implied bythe first mapping). Finally, the third mapping differs fromt3 by incorporating a join with Group before it can actuallyuse the Skolem term F[g]. Furthermore, it inherits the fil-ter on the addr field, which applies to all such Works tuplesaccording to t2. As an additional artifact of MapMerge, whichwe explain later, it also includes a Skolem term H1[w] thatassigns values for budget . The target instance that is obtainedby applying the result of MapMerge is the instance J2 shownin Fig. 2. The data associations that exist in the source are nowcorrectly preserved in the target. For example, W eb is linked

Fig. 2 An instance of S1and two instances of S2

123

Page 4: MapMerge: correlating independent schema mappings

194 B. Alexe et al.

to the C S tuple (via D), and also John is linked to the C Stuple (via the same D). Furthermore, there is no duplicationof Dept tuples.

Flows of mappings. Taking the idea of mapping reuse andmodularity one step further, an even more compelling usecase for MapMerge in conjunction with mapping compo-sition [9,15,18] is the flow-of-mappings scenario [1]. Thekey idea here is that to produce a data transformation fromthe source to the target, one can decompose the process intoseveral simpler stages, where each stage maps from or intosome intermediate, possibly simpler schema. Moreover, thesimpler mappings and schemas play the role of reusablecomponents that can be applied to build other flows. Suchabstraction is directly motivated by the development of real-life, large-scale ETL flows such as those typically developedwith IBM Information Server (Datastage), Oracle WarehouseBuilder, and others.

To illustrate, suppose the goal is to transform data fromthe schema S1 of Fig. 1a to a new schema S4, where Staff andProjects information are grouped under CompSci. The map-ping or ETL designer may find it easier to first construct themapping between S1 and S2 (it may also be that this mappingmay have been derived in a prior design). Furthermore, theschema S2 is a normalized representation of the data, whereDept, Emp, and Proj correspond directly to the main concepts(or types of data) that are being manipulated. Based on thisschema, the designer can then produce a mapping mCS fromDept to a schema S3 containing a more specialized objectCSDept, by applying some customized filter condition (e.g.,based on the name of the department). The next step is tocreate the mapping m from CSDept to the target schema S4.Other independent mappings are similarly defined for Empand Proj (see m1 and m2).

Once these individual mappings are established, the sameproblem of correlating the mappings arises. In particular, onehas to correlate mCS ◦ m, which is the result of applyingmapping composition to mCS and m, with the mappings m1

for Emp and m2 for Proj. This correlation will ensure thatall employees and projects of computer science departmentswill be correctly mapped under their correct departments, inthe target schema.

In this example, composition itself gives another source ofmappings to be correlated by MapMerge. While similar withcomposition in that it is an operator on schema mappings,MapMerge is fundamentally different in that it correlatesmappings that share the same source schema and the sametarget schema. In contrast, composition takes two sequentialmappings where the target of the first mapping is the sourceof the second mapping. Nevertheless, the two operators arecomplementary, and together, they can play a fundamentalrole in building data flows. In Sect. 6, we will give an algo-rithm that can be used to correlate flows of schema mappings.

1.2 Contributions and outline of the paper

Our main technical contributions are as follows. We givean algorithm for MapMerge, which takes as input arbi-trary schema mappings expressed as second-order tgds [9]and generates correlated second-order tgds. As a particularimportant case, MapMerge can also take as input a set of rawschema correspondences; thus, it constitutes a replacementof existing mapping generation algorithms that are used inClio [11,19]. Note that in this paper, as explained in latersections, we consider a slightly different treatment of user-defined filters in MapMerge with respect to the conferenceversion [2] of our work, allowing for more consolidation ofschema mappings through further correlations. We introducea novel similarity measure that is used to quantify the preser-vation of data associations from a source database to a targetdatabase. We use this measure to show experimentally thatMapMerge improves the quality of schema mappings. In par-ticular, we show that the target data that is produced basedon the outcome of MapMerge has better quality, in terms ofpreservation of source associations, than the target data thatis produced based on Clio-generated mappings.

Furthermore, we provide an algorithm, which did notappear in the conference version of this paper, that combinesMapMerge with the mapping composition algorithm in [9]to correlate flows of schema mappings. The resulting algo-rithm makes uniform use of the language of SO tgds. One ofthe main features of the language of SO tgds is the presenceof Skolem functions, which is the key ingredient enablingmapping correlation via MapMerge. In addition, as shown in[9], SO tgds are closed under composition, while first-ordertgds are not. In other words, Skolem functions are necessaryto express composition. Hence, the language of SO tgds isideally suited for both MapMerge and mapping composition.

Outline In the next section, we provide some preliminarieson schema mappings and their semantics. In Sect. 3, we givethe main intuition behind MapMerge, while in Sect. 4, wedescribe the algorithm. In Sect. 5, we introduce the similar-ity measure that quantifies the preservation of associations.We make use of this measure to evaluate the performance ofMapMerge on real-life and synthetic mapping scenarios. Weintroduce and illustrate the mapping flow correlation algo-rithm in Sect. 6. We discuss related work in Sect. 7 and con-clude in Sect. 8.

2 Preliminaries

A schema consists of a set of relation symbols, each with anassociated set of attributes. Moreover, each schema can havea set of inclusion dependencies modeling foreign key con-straints. While we restrict our presentation to the relational

123

Page 5: MapMerge: correlating independent schema mappings

Correlating independent schema mappings 195

case, all our techniques are applicable and implemented inthe more general case of the nested relational data modelused in [19], where the schemas and mappings can be eitherrelational or XML.

Schema mappings. A schema mapping is a triple (S, T,Σ)

where S is a source schema, T is a target schema, and Σ is aset of second-order tuple-generating dependencies (SO tgds)[9]. In this paper, we use the notation

for x in Ssatisfying B1(x)exists y in Twhere B2(y) and C(x, y)

for expressing SO tgds. Examples of SO tgds in this nota-tion were already given in Fig. 1b and c. Here, it sufficesto say that S represents a vector of source relation symbols(possibly repeated), while x represents the tuple variablesthat are bound, correspondingly, to these relations. A similarnotation applies for the exists clause. The conditions B1(x)

and B2(y) are conjunctions of equalities over the sourceand, respectively, target variables. Note that these conditionsmay equate variables with constants, allowing the definitionof user-defined filters. The condition C(x, y) is a conjunc-tion of equalities that equate target expressions (e.g., y.A)with either source expressions (e.g., x .B) or Skolem terms ofthe form F[x1, . . . , xi ], where F is a function symbol andx1, . . . , xi are a subset of the source variables. Skolem termsare used to relate target expressions across different SO tgds.An SO tgd without a Skolem term may also be called, simply,a tuple-generating dependency or tgd [8].

Note that our SO tgds do not allow equalities between orwith Skolem terms in the satisfying clause. While such equal-ities may be needed for more general purposes [9], they donot play a role for data exchange and can be eliminated, asobserved in [25].

Chase-based semantics. The semantics that we adopt fora schema mapping (S, T,Σ) is the standard data exchangesemantics [8] where, given a source instance I , the resultof “executing” the mapping is the target instance J that isobtained by chasing I with the dependencies in Σ . Since thedependencies in Σ are SO tgds, we actually use an extensionof the chase as defined in [9].

Intuitively, the chase provides a way of populating the tar-get instance J in a minimal way, by adding the tuples thatare required by Σ . For every instantiation of the for clauseof a dependency in Σ such that the satisfying clause is satis-fied but the exists and where clauses are not, the chase addscorresponding tuples to the target relations. Fresh new val-ues (also called labeled nulls) are used to give values for thetarget attributes for which the dependency does not provide

a source expression. Additionally, Skolem terms are instan-tiated by nulls in a consistent way: a term F[x1, . . . , xi ] isreplaced by the same null every time x1, . . . , xi are instan-tiated with the same source tuples. Finally, to obtain a validtarget instance, we must chase (if needed) with the targetconstraints.

For our earlier example, the target instance J1 is the resultof chasing the source instance I with the tgds in Fig. 1b and,additionally, with the foreign key constraints. There, the val-ues D1, D2, D3 are nulls that are generated to fill did valuesfor which the tgds do not provide a source expression. Thetarget instance J2 is the result of chasing I with the SO tgdsin Fig. 1c. There, D is a null that corresponds to the Sko-lem term F[g] where g is instantiated with the sole tuple ofGroup.

In practice, mapping tools such as Clio do not necessarilyimplement the chase with Σ , but generate queries to achievea similar result [11,19].

3 Correlating mappings: key ideas

How do we achieve the systematic and, moreover, correctconstruction of correlated mappings? After all, we do notwant arbitrary correlations between mappings, but ratheronly to the extent that the natural data associations in thesource are preserved and no extra associations are introduced.

There are two key ideas behind MapMerge. The first ideais to exploit the structure and the constraints in the schemasin order to define what natural associations are (for the pur-pose of the algorithm). Two data elements are consideredassociated if they are in the same tuple or in two differenttuples that are linked via constraints. This idea has been usedbefore in Clio [19] and provides the first (conceptual) steptoward MapMerge. For our example, the input mapping t3 inFig. 1b is equivalent, in the presence of the source and targetconstraints, to the following enriched mapping:

t ′3: for w in Works, g in Groupsatisfying w.gno = g.gnoexists p in Proj, d in Deptwhere p.pname = w.pname and p.did = d.did

Intuitively, if we have a w tuple in Works, we also have a join-ing tuple g in Group, since gno is a foreign key from Worksto Group. Similarly, a tuple p in Proj implies the existenceof a joining tuple in Dept, since did is a foreign key fromProj to Dept.

Formally, the above rewriting from t3 to t ′3 is captured bythe well-known chase procedure [3,16]. The chase is a con-venient tool to group together, syntactically, elements of theschema that are associated. The chase by itself, however, doesnot change the semantics of the mapping. In particular, the

123

Page 6: MapMerge: correlating independent schema mappings

196 B. Alexe et al.

above t ′3 does not include any additional mapping behaviorfrom Group to Dept.

The second key idea behind MapMerge is that of reusingor borrowing mapping behavior from a more general map-ping to a more specific one. This is a heuristic that changesthe semantics of the entire schema mapping and produces anarguably better one, with consolidated semantics.

To illustrate, consider the first mapping constraint inFig. 1c. This constraint (obtained by skolemizing the inputt1) specifies a general mapping behavior from Group to Dept.In particular, it specifies how to create dname and did fromthe input record. On the other hand, the above t ′3 can be seenas a more specific mapping from a subset of Group (i.e., thosegroups that have associated Works tuples) to a subset of Dept(i.e., those departments that have associated Proj tuples). Atthe same time, t ′3 does not specify any concrete mapping forthe dname and did fields of Dept. We can then borrow themapping behavior that is already specified by the more gen-eral mapping. Thus, t ′3 can be enriched to:

t ′′3 : for w in Works, g in Groupsatisfying w.gno = g.gnoexists p in Proj, d in Deptwhere p.pname = w.pname and p.did = d.did

and d.dname = g.gnameand d.did = F[g] and p.did = F[g]

where two of the last three equalities represent the “bor-rowed” behavior, while the last equality is obtained auto-matically by transitivity. The other borrowed behavior thatwe will add to t ′′3 is the user-defined filter on addr. This filteralready applies, according to t2, to all tuples in Works thatjoin with Group tuples and are mapped to Emp and Depttuples. The resulting constraint t ′′′3 has the following form:

t ′′′3 : for w in Works, g in Group

satisfying w.gno = g.gno and w.addr = “NY”

exists p in Proj, d in Deptwhere p.pname = w.pname and p.did = d.did

and d.dname = g.gnameand d.did = F[g] and p.did = F[g]

Finally, we can drop the existence of d in Dept with the twoconditions for dname and did, since this is repeated behav-ior that is already captured by the more general mappingfrom Group to Dept. The resulting constraint is identical3 tothe third constraint in Fig. 1c, now correlated with the firstone via F[g]. A similar explanation applies for the secondconstraint in Fig. 1c.

The actual MapMerge algorithm is more complex thanintuitively suggested above and is described in detail in thenext section.

3 Modulo the absence of H1[w], which will be explained separately.

4 The MapMerge algorithm

MapMerge takes as input a set {(S, T,Σ1), . . ., (S, T,Σn)}of schema mappings over the same source and targetschemas, which is equivalent to taking a single schema map-ping (S, T,Σ1 ∪ . . .∪Σn) as input. The algorithm is dividedinto four phases, and an outline of it is presented in Fig. 3.The first phase decomposes each input mapping assertioninto basic components that are, intuitively, easier to merge.In Phase 2, we apply the chase algorithm to compute associ-ations (which we call tableaux), from the source and targetschemas, as well as from the source and target assertionsof the input mappings. The latter type of tableaux is neces-sary to support user-defined joins that may not follow foreignkey constraints. By pairing source and target tableaux, weobtain all the possible skeletons of mappings. The actual workof constructing correlated mappings takes place in Phase 3,where for each skeleton, we take the union of all the basiccomponents generated in Phase 1 that “match” the skeleton.Phase 4 is a simplification phase that also flags conflicts thatmay arise and that need to be addressed by the user. Theseconflicts occur when multiple mappings that map to the sameportion of the target schema contribute with different, irrec-oncilable behaviors.

Properties of MapMerge. In the absence of conflicts, whichmay appear in Phase 4 of the algorithm and have to be solvedby the user, the MapMerge operator is commutative, idem-potent, and associative. Intuitively, this is because in Phase 3,the MapMerge algorithm considers the conjunction of all thebasic components that match on each skeleton, and conjunc-tion is associative, idempotent, and commutative. However,if conflict resolution is needed, MapMerge is not necessarilyassociative.

The number of constraints in the output of MapMergeis at most the number of skeletons. This worst case cannotbe improved since, in general, every skeleton may lead toan output constraint. Moreover, the size of each constraint(number of equalities in the where clause) is linear in thesize of the input mappings.

4.1 Phase 1: decompose into basic SO tgds

The first step of the algorithm decomposes each input SO tgdinto a set of simpler SO tgds, called basic SO tgds, that havethe same for and satisfying clause as the input SO tgd buthave exactly one relation in the exists clause. Intuitively, webreak the input mappings into atomic components that eachspecify mapping behavior for a single target relation. Thisstep will subsequently allow us to merge mapping behav-iors even when they come from different input SO tgds. Thedecomposition algorithm is given in Fig. 4.

123

Page 7: MapMerge: correlating independent schema mappings

Correlating independent schema mappings 197

Fig. 3 The MapMerge algorithm. The algorithm makes calls to severalsubroutines, which are listed separately, in the respective subsections

In addition to being single relation in the target, each basicSO tgd gives a complete specification of all the attributes ofthe target relation. More precisely, each basic SO tgd has theform

Fig. 4 The Decompose algorithm used by MapMerge

for x in S satisfying B1(x)

exists y in T where∧

A∈Atts(y) y.A = eA(x)

where the conjunction in the where clause contains one equal-ity constraint for each attribute of the record y asserted in thetarget relation T . The expression eA(x) is either a Skolemterm or a source expression (e.g., x .B). Part of the role ofthe decomposition phase is to assign a Skolem term to everytarget expression y.A for which the initial mapping does notequate it to a source expression.

For our example, the decomposition algorithm obtains thefollowing basic SO tgds from the input mappings t1, t2, andt3 of Fig. 1b:

(b1): for g in Group exists d in Deptwhere d.did = F[g] and d.dname = g.gname

(b2): for w in Works, g in Groupsatisfying w.gno = g.gno and w.addr = “NY”exists e in Empwhere e.ename = w.ename and e.addr = w.addr

and e.did = G[w, g]

(b′2): for w in Works, g in Group

satisfying w.gno = g.gno and w.addr = “NY”exists d in Deptwhere d.did = G[w, g] and d.dname = g.gname

123

Page 8: MapMerge: correlating independent schema mappings

198 B. Alexe et al.

(b3): for w in Works exists p in Projwhere p.pname = w.pname and p.budget = H1[w]

and p.did = H2[w]

The basic SO tgd b1 is obtained from t1; the main differ-ence is that d.did, whose value was unspecified by t1, is nowexplicitly assigned the Skolem term F[g]. The only argumentto F is g because g is the only record variable that occurs inthe for clause of t1. Similarly, the basic SO tgd b3 is obtainedfrom t3, with the difference being that p.budget and p.didare now explicitly assigned the Skolem terms H1[w] and,respectively, H2[w].

In the case of t2, we note that we have two existentiallyquantified variables, one for Emp and one for Dept. Hence,the decomposition algorithm generates two basic SO tgds:the first one maps into Emp and the second one maps intoDept. Observe that b2 and b′

2 are correlated and share a com-mon Skolem term G[w, g] that is assigned to both e.did andd.did. Thus, the association between e.did and d.did in theoriginal schema mapping t2 is maintained in the basic SOtgds b2 and b′

2.In general, the decomposition process ensures that asso-

ciations between target facts that are asserted by the originalschema mapping are not lost. The process is similar to theSkolemization procedure that transforms first-order tgds withexistentially quantified variables into second-order tgds withSkolem functions (see [9]). After such Skolemization, all thetarget relations can be separated since they are correlated viaSkolem functions. Therefore, the set of basic SO tgds thatresults after decomposition is equivalent to the input set ofmappings.

4.2 Phase 2: compute skeletons of schema mappings

Next we apply the chase algorithm to compute syntactic asso-ciations (which we call tableaux), from each of the schemasand from the input mappings. Essentially, a schema tableauis constructed by taking each relation symbol in the schemaand chasing it with all the referential constraints that apply.The result of such chase is a tableau that incorporates a set ofrelations that is closed under referential constraints, togetherwith the join conditions that relate those relations. For eachrelation symbol in the schema, there is one schema tableau.As in [11,19], in order to guarantee termination, we stop thechase whenever we encounter cycles in the referential con-straints. In our example, there are two source schema tableauxand three target schema tableaux, as follows:

T1 = { g ∈ Group }T2 = { w ∈ Works, g ∈ Group; w.gno = g.gno }T3 = { d ∈ Dept }T4 = { e ∈ Emp, d ∈ Dept; e.did = d.did }T5 = { p ∈ Proj, d ∈ Dept; p.did = d.did }

Intuitively, schema tableaux represent the categories ofdata that can exist according to the schema. A Group recordcan exist independently of records in other relations (hence,the tableau T1). However, the existence of a Works recordimplies that there must exist a corresponding Group recordwith identical gno (hence, the tableau T2).

Since the MapMerge algorithm takes as input arbitrarymapping assertions that may join relations following pathsthat do not follow key/foreign key constraints, we also needto generate user-defined mapping tableaux. These tableauxare obtained by chasing the source and target assertions ofthe input mappings with the referential constraints that areapplicable from the schemas. The notion of user-defined tab-leaux is similar to the notion of user associations in [24].However, user filters are not included in tableaux. They arehandled separately in Phase 3 of the MapMerge algorithm.In our example, there are no user-defined tableaux, since allthe joins are based on foreign key constraints.

Following the generation of tableaux, the algorithm con-tinues by pairing every source tableau with every target tab-leau to form a skeleton. Each skeleton represents the emptyshell of a candidate mapping. For our running example, theset of all skeletons at the end of Phase 2 is: {(T1, T3), (T1, T4),(T1, T5), (T2, T3), (T2, T4), (T2, T5)}.

4.3 Phase 3: match and apply basic SO tgds on skeletons

In this phase, for each skeleton, we first find the set of basicSO tgds that “match” the skeleton. Then, for each skeleton,we apply the basic SO tgds that were found matching andconstruct a merged SO tgd. The resulting SO tgd is, intui-tively, the “conjunction” of all the basic SO tgds that werefound matching.

Matching. We say that a basic SO tgd σ matches a skele-ton (T, T ′) if there is a pair (h, g) of homomorphisms that“embed” σ into (T, T ′). This translates into two conditions.First, the for and satisfying clause of σ are embedded into Tvia the homomorphism h. This means that h maps the vari-ables in the for clause of σ to variables of T such that relationsymbols are respected and, moreover, the satisfying clause ofσ (after applying h) is implied by the conditions of T . Addi-tionally, the exists clause of σ must be embedded into T ′ viathe homomorphism g. Since σ is a basic SO tgd and thereis only one relation in its exists clause, the latter conditionessentially states that the target relation in σ must occur inT ′. We remark here that user-defined filters are not consid-ered for determining whether a basic SO tgd matches on askeleton. This is justified because user filters do not appearin tableaux, and thus, tableaux cannot imply the conditionsspecified as user filters. However, we still want to be able tomatch basic SO tgds with user filters on skeletons. The userfilters will still be used in the merging part of this MapMerge

123

Page 9: MapMerge: correlating independent schema mappings

Correlating independent schema mappings 199

Fig. 5 The Match algorithm used by MapMerge

phase. We present the pseudocode of the matching algorithmin Fig. 5.

For our running example, it is easy to see that the basicSO tgd b1 matches the skeleton (T1, T3). In fact, b1 matchesevery skeleton from Phase 2. On the other hand, the basic SOtgd b2 matches only the skeleton (T2, T4) under the homo-morphisms (h1, h2), where h1 = {w �→ w, g �→ g} andh2 = {e �→ e}. Altogether, we obtain the following match-ing of basic SO tgds on skeletons:

(T1, T3, b1) (T1, T4, b1)

(T1, T5, b1) (T2, T3, b1 ∧ b′2)

(T2, T4, b1 ∧ b2 ∧ b′2) (T2, T5, b1 ∧ b′

2 ∧ b3)

Note that the basic SO tgds that match a given skeletonmay actually come from different input mappings. For exam-ple, each of the basic SO tgds that match (T2, T5) comes froma separate input mapping (from t1, t2, and t3, respectively).In a sense, we aggregate behaviors from multiple input map-pings in a given skeleton.

Computing merged SO tgds. For each skeleton along withthe matching basic SO tgds, we now construct a “merged”SO tgd. The pseudocode of this consolidation phase is pre-sented in Fig. 6. For our example, the following SO tgd s5 isconstructed from the fifth triple (T2, T4, b1 ∧ b2 ∧ b′

2) shownearlier.

Fig. 6 The ConstructSOtgd algorithm used by MapMerge

(s5) for w in Works, g in Groupsatisfying w.gno = g.gno and w.addr = “NY”exists e in Emp, d in Deptwhere e.did = d.didand d.did = F[g] and d.dname = g.gnameand e.ename = w.ename and e.addr = w.addrand e.did = G[w, g] and d.did = G[w, g]

The variable bindings in the source and target tableaux aretaken literally and added to the for and, respectively, existsclause of the new SO tgd. The equalities in T2 and T4 are alsotaken literally and added to the satisfying and, respectively,where clause of the SO tgd. More interestingly, for everybasic SO tgd σ that matches the skeleton (T2, T4), we takethe where clause of σ (after applying the respective homo-morphisms) and add it to the where clause of the new SO tgd.In addition, we take the user-defined filters in the satisfyingclause of σ (the equalities between source expressions andconstants) and add them to the satisfying clause of the newSO tgd. The last three lines in the above SO tgd incorporateconditions taken from each of the basic SO tgds that match(T2, T4) (i.e., from b1, b2, and b′

2, respectively).The constructed SO tgd consolidates the semantics of

b1, b2, and b′2 under one merged mapping. Intuitively, since

all three basic SO tgds are applicable whenever the sourcepattern is given by T2 and the target pattern is given by T4,

123

Page 10: MapMerge: correlating independent schema mappings

200 B. Alexe et al.

the resulting SO tgd takes the conjunction of the “behaviors”of the individual basic SO tgds.

Correlations. A crucial point about the above constructionis that a target expression may now be assigned multipleexpressions. For example, in the above SO tgd, the targetexpression d.did is equated with two expressions: F[g] viab1, and G[w, g] via b′

2. In other words, the semantics of thenew constraint requires the values of the two Skolem termsto coincide. This is actually what it means to correlate b1 andb′

2. We can represent such correlation, explicitly, as the fol-lowing conditional equality (implied by the above SO tgd):

for w in Works, g in Groupsatisfying w.gno = g.gno and w.addr = “NY”

⇒ F[g] = G[w, g]We use the term residual equality constraint for such

equality constraint where one member in the implied equalityis a Skolem term, while the other is either a source expressionor another Skolem term. Such constraints have to be enforcedat runtime when we perform data exchange with the result ofMapMerge. In general, Skolem functions are implemented as(independent) lookup tables, where for every different com-bination of the arguments, the lookup table gives a fresh newnull. However, residual constraints will require correlationbetween the lookup tables. For example, the above constraintrequires that the two lookup tables (for F and G) must givethe same value whenever w and g are tuples of Works andGroup with the same gno value.

To conclude the presentation of Phase 3, we list the othertwo merged SO tgds below that result after this phase for ourexample.

(s1) from (T1, T3, b1):for g in Groupexists d in Deptwhere d.did = F[g] and d.dname = g.gname

(s6) from (T2, T5, b1 ∧ b′2 ∧ b3):

for w in Works, g in Groupsatisfying w.gno = g.gno and w.addr = “NY”exists p in Proj, d in Deptwhere p.did = d.didand d.did = F[g] and d.dname = g.gnameand p.pname = w.pname and p.budget = H1[w]and p.did = H2[w]and d.did = G[w, g]

One aspect to note is that not all skeletons generate mergedSO tgds. Although we had six earlier skeletons, only threegenerate mappings that are neither subsumed nor implied.We use here the technique for pruning subsumed or impliedmappings described in [11]. For an example of a subsumed

mapping, consider the triple (T1, T4, b1). We do not gener-ate a mapping for this, because its behavior is subsumed bys1, which includes the same basic component b1 but mapsinto a more “general” tableau, namely T3. Intuitively, we donot want to construct a mapping into T4, which is a larger(more specific) tableau, without actually using the extra partof T4. Implied mappings are those that are logically impliedby other mappings. For example, the mapping that would cor-respond to (T2, T3, b1 ∧ b′

2) is logically implied by s6: theyboth have the same premise (T2), but s6 asserts facts abouta larger tableau (T5, which includes T3) and already coversb1. For completeness, we list here the remaining mappingswhich are either subsumed or logically implied by the mergedSO tgds {s1, s5, s6} and are thus not considered after Phase3. Specifically, s2 is subsumed by s1 and s3 is subsumed bys1. Moreover, s4 is logically implied by both s5 and s6.

(s2) from (T1, T4, b1):for g in Groupexists e in Emp, d in Deptwhere d.did = F[g] and d.dname = g.gname

(s3) from (T1, T5, b1):for g in Groupexists p in Proj, d in Deptwhere d.did = F[g] and d.dname = g.gname

(s4) from (T2, T3, b1 ∧ b′2):

for w in Works, g in Groupsatisfying w.gno = g.gno and w.addr = “NY”exists d in Deptwhere d.did = F[g] and d.did = G[w, g]

and d.dname = g.gname

Finally, for our example, we also obtain three more resid-ual equality constraints, arising from s6, and stating the pair-wise equalities of F[g], H2[w], and G[w, g] (since they areall equal to p.did and d.did, which are also equal to eachother).

Since residual equalities cause extra overhead at runtime,it is worthwhile exploring when such constraints can be elim-inated without changing the overall semantics. We describesuch method next.

4.4 Phase 4: eliminate residual equality constraints

The fourth and final phase of the MapMerge algorithmattempts to eliminate as many Skolem terms as possible fromthe generated SO tgds. The key idea is that, for each resid-ual equality constraint, we attempt to substitute, globally,one member of the equality with the other member. If thesubstitution succeeds, then there is one less residual equality

123

Page 11: MapMerge: correlating independent schema mappings

Correlating independent schema mappings 201

constraint to enforce during runtime. Moreover, the resultingSO tgds are syntactically simpler.

Consider our earlier residual constraint stating the equal-ity F[g] = G[w, g] (under the conditions of the for andsatisfying clauses). The two Skolem terms F[g] and G[w, g]occur globally in multiple SO tgds. To avoid the explicitmaintenance and correlation of two lookup tables (for bothF and G), we attempt the substitution of either F[g] withG[w, g] or G[w, g] with F[g]. Care must be taken sincesuch substitution cannot be arbitrarily applied. First, the sub-stitution can only be applied in SO tgds that satisfy thepreconditions of the residual equality constraint. For ourexample, we cannot apply either substitution to the earlierSO tgd s1, since the precondition requires the existence ofWorks tuple that joins with Group. In general, we need tocheck for the existence of a homomorphism that embeds thepreconditions of the residual equality constraint into the forand where clauses of the SO tgd. The second issue is thatthe direction of the substitution matters. For example, let ussubstitute F[g] by G[w, g] in every SO tgd that satisfies thepreconditions. There are two such SO tgds: s5 and s6. Afterthe substitution, in each of these SO tgds, the equality d.did =F[g] becomes d.did = G[w, g] and can be dropped, since it isalready in the where clause. Note, however, that the replace-ment of F[g] by G[w, g] did not succeed globally. The SOtgd s1 still refers to F[g]. Hence, we still need to maintain theexplicit correlation of the lookup tables for F and G. On theother hand, let us substitute G[w, g] by F[g] in every SO tgdthat satisfies the preconditions. Again, there are two such SOtgds: s5 and s6. The outcome is different now: G[w, g] dis-appears from both s5 and s6 (in favor of F[g]); moreover, itdid not appear in s1 to start with. We say that the substitutionof G[w, g] by F[g] has globally succeeded.

Similarly, based on the other residual equality constraintwe had earlier, we can apply the substitution of H2[w] byF[g]. This affects only s6 and the outcome is that H2[w] hasbeen successfully replaced globally. The resulting SO tgds,for our example, are:

(s1) for g in Groupexists d in Deptwhere d.did = F[g] and d.dname = g.gname

(s′5) for w in Works, g in Group

satisfying w.gno = g.gno and w.addr = “NY”exists e in Emp, d in Deptwhere e.did = d.did

and d.did = F[g] and d.dname = g.gnameand e.ename = w.ename and e.addr = w.addrand e.did = F[g]

(s′6) for w in Works, g in Group

satisfying w.gno = g.gno and w.addr = “NY”

Fig. 7 The FindSubstitution algorithm used by MapMerge. The resid-ual equality constraints are created as needed (in the form of actualsubstitutions) from the input SO tgds

exists p in Proj, d in Deptwhere p.did = d.didand d.did = F[g] and d.dname = g.gnameand p.pname = w.pname and p.budget = H1[w]and p.did = F[g]

As explained in Sect. 3, both s′5 and s′

6 can be simplified,by removing the assertions about Dept, since they are impliedby s1. The result is then identical to the SO tgds shown inFig. 1c.

Our example covered only residual equalities betweenSkolem terms. The case of equalities between a Skolem termand a source expression is similar, with the difference thatwe form only one substitution (to replace the Skolem termby the source expression).

The exact algorithm for eliminating residual constraints,given in Figs. 7 and 8, is an exhaustive algorithm that formseach possible substitution (Fig. 7) and attempts to apply it on

123

Page 12: MapMerge: correlating independent schema mappings

202 B. Alexe et al.

Fig. 8 The Substitute algorithm used by MapMerge

the existing SO tgds (Fig. 8). If the replacement is globallysuccessful, the residual equality constraint that generated thesubstitution can be eliminated. Then, the algorithm goes on toeliminate other residual constraints on the rewritten SO tgds.If the replacement is not globally successful, the algorithmtries the reverse substitution (if applicable). In general, it maybe the case that neither substitution succeeds globally. In suchcase, the corresponding residual constraint is kept as part ofthe output of MapMerge. Thus, the outcome of MapMergeis, in general, a set of SO tgds together with a set of resid-ual equality constraints. (For our example, the latter set isempty.)

We add next a simple example showing that residualequality constraints cannot always be eliminated. Considerthe following two input constraints over a source schema{R(A), S(A, B)} and a target schema {T (A, C)}:

for r in Rexists t in Twhere t.A = r.A and

t.C = F[r ]

for r in R, s in Ssatisfying r.A = s.Aexists t in Twhere t.A = r.A and

t.C = s.BIn Phase 3 of the MapMerge algorithm, the following

residual equality constraint is generated:

for r in R, s in S satisfying r.A = s.A ⇒ F[r ] = s.B

The replacement of F[r ] with s.B is the only possiblesubstitution. However, it does not succeed, simply becauses.B is not in the scope of the first constraint. Syntactically,the result of applying the substitution would not be a validconstraint.

Finally, the last issue that arises is the case of conflicts inmapping behavior. Conflicts can also be described via con-

straints, similar to residual equality constraints but with themain difference that both members of the equality are sourceexpressions (and not Skolem terms). To illustrate, it might bepossible that a merged SO tgd asserts that the target expres-sion d.dname is equal to both g.gname (from some inputmapping) and with g.code (from some other input mapping,assuming that code is some other source attribute). Then,we obtain conflicting semantics, with two competing sourceexpressions for the same target expression. Our algorithmflags such conflicts, whenever they arise and returns the map-ping to the user to be resolved.

5 Evaluation

To evaluate the quality of the data generated based on Map-Merge, we introduce a measure that captures the similar-ity between a source and target instance by measuring theamount of data associations preserved by the transformationfrom the source to the target instance. We use this similar-ity measure in our experiments to show that the mappingsderived by MapMerge are better than the input mappings.The experimental results are presented in Sect. 5.2.

5.1 Similarity measure

The main idea behind our measure is to capture the extent towhich the “associations” in a source instance are preservedwhen transformed into a target instance of a different schema.For each instance, we will compute a single relation thatincorporates all the natural associations between data ele-ments that exist in the instance. There are two types of asso-ciations we consider. The first type is based on the chase withreferential constraints and is naturally captured by tableaux.As seen in Sect. 4.2, a tableau is a syntactic object that takesthe “closure” of each relation under referential constraints.We can then materialize the join query that is encoded in eachtableau and select all the attributes that appear in the inputrelations (without duplicating the foreign key/key attributes).Thus, for each tableau, we obtain a single relation, called tab-leau relation, that conveniently materializes together dataassociations that span multiple relations. For example, thetableau relations for the source instance I in Fig. 2 (for tab-leaux T1 and T2 in Sect. 4.2) are shown on top of Fig. 9b. Wedenote the tableau relations of an instance I of schema S asτS(I ), or simply τ(I ). The tableau relations τ(J1) and τ(J2)

for our running example are also shown in Fig. 9.The second type of association that we consider is based on

the notion of full disjunction [12,21]. Intuitively, the full dis-junction of relations R1, . . ., Rk , denoted as FD(R1, . . ., Rk),captures in a single relation all the associations (via natu-ral join) that exist among tuples of the input relations. Thereason for using full disjunction is that tableau relations by

123

Page 13: MapMerge: correlating independent schema mappings

Correlating independent schema mappings 203

(a) (b) (c)

Fig. 9 Tableau relations of J1, I , and J2 of Fig. 2 and their full disjunctions

themselves do not capture all the associations. For example,consider the association that exists between John and Webin the earlier source instance J2. There, John is an employeein CS, and Web is a project in CS. However, since there isno directed path via foreign keys from John to Web, the twodata elements appear in different tableau relations of τ(J2)

(namely, DeptEmp and DeptProj). On the other hand, if wetake the natural join between DeptEmp and DeptProj, theassociation between John and Web will appear in the result.Thus, to capture all such associations, we apply an addi-tional step, which computes the full disjunction FD(τ (I ))of the tableau relations. This generates a single relation thatconveniently captures all the associations in an instance I ofschema S. Intuitively, each tuple in this relation correspondsto one association that exists in the data.

Operationally, full disjunction must perform the outer“union” of all the tuples in every input relation, together withall the tuples that arise via all possible natural joins amongthe input relations. To avoid redundancy, minimal union isused instead of union. This means that in the final relation,tuples that are subsumed by other tuples are pruned. A tuplet is subsumed by a tuple t ′ if for all attributes A such thatt.A = null, it is the case that t ′.A = t.A. We omit herethe details of implementing full disjunction, but we note thatsuch implementation is part of our experimental evaluation.

For our example, we show FD(τ (J1)), FD(τ (I )), andFD(τ (J2)) at the bottom of Fig. 9. There, we use the ‘−’ sym-bol to represent the SQL null value. We note that FD(τ (J2))

connects now all three of John, Web, and CS in one tuple.Now that we have all the associations in a single relation,

one on each side (source or target), we can compare them.More precisely, given a source instance I and a target instanceJ , we define the similarity between I and J by defining thesimilarity between FD(τ (I )) and FD(τ (J )). However, whenwe compare tuples between FD(τ (I )) and FD(τ (J )), weshould not compare arbitrary pairs of attributes. Intuitively, toavoid capturing “accidental” preservations, we want to com-pare tuples based only on their compatible attributes that arisefrom the mapping. In the following, we assume that all themappings that we need to evaluate implement the same set Vof correspondences between attributes of the source schema Sand attributes of the target schema T. This assumption is true

for mapping generation algorithms, which start from a setof correspondences and generate a faithful implementationof the correspondences (without introducing new attribute-to-attribute mappings). It is also true for MapMerge and itsinput, since MapMerge does not introduce any new attribute-to-attribute mappings that are not already specified by theinput mappings. Given a set V of correspondences betweenS and T, we say that an attribute A of S is compatible withan attribute B of T, and we write A ∼ B, if either thereis a direct correspondence between A and B in V , or (2)A is related to an attribute A′ via a foreign key constraintof S, B is related to an attribute B ′ via a foreign key con-straint of T, and A′ is compatible with B ′. For our example,the pairs of compatible attributes (from source to target) are:(gname, dname), (ename, ename), (addr, addr),

(pname, pname).

Definition 1 (Tuple similarity) Let t1 and t2 be two tuples inFD(τ (I )) and, respectively, FD(τ (J )). The similarity of t1and t2, denoted as Sim(t1, t2), is defined as:

|{A ∈ Atts(t1) | ∃B ∈ Atts(t2), A ∼ B, t1.A = t2.B =⊥}||{A ∈ Atts(t1) | ∃B ∈ Atts(t2), A ∼ B}|

Intuitively, Sim(t1, t2) captures the ratio of the number of val-ues that are actually exported from t1 to t2 versus the numberof values that could be exported from t1 according to V . Forinstance, let t1 be the only tuple in FD(τ (I )) from Fig. 9and t2 the only tuple in FD(τ (J2)). Then, Sim(t1, t2) is 1.0,since t1.A = t2.B for every pair of compatible attributes Aand B. Now, let t2 be the first tuple in FD(τ (J1)). Since onlyt1.gname = t2.dname out of four possible pairs of compatibleattributes, we have that Sim(t1, t2) is 0.25.

Definition 2 (Instance similarity) The similarity betweenFD(τ (I )) and FD(τ (J )) is

Sim(FD(τ (I )), FD(τ (J ))

=∑

t1∈FD(τ (I ))maxt2∈FD(τ (J )) Sim(t1, t2).

Figure 9 depicts the similarities Sim(FD(τ (I )), FD(τ (J1)))and Sim(FD(τ (I )), FD(τ (J2))). The former similarity scoreis obtained by comparing the only tuple in FD(τ (I )) with thebest matching tuple (i.e., the second tuple) in FD(τ (J1))).

123

Page 14: MapMerge: correlating independent schema mappings

204 B. Alexe et al.

5.2 Experimental results

We conducted a series of experiments on a set of syntheticand real-life mapping scenarios to evaluate MapMerge. Wefirst report on the synthetic mapping scenarios and, using thesimilarity measure presented in Sect. 5, demonstrate a clearimprovement in the preservation of data associations whenusing MapMerge. We then present results for two interest-ing real-life scenarios, whose characteristics match those ofour synthetic scenarios. We have also implemented some ofour synthetic scenarios on two commercial mapping systems.The comparison between the mappings generated by thesesystems and by MapMerge produced results similar to theprevious experiments.

We implemented MapMerge in Java as a module of Clio[11]. For all our experiments, we started by creating the map-pings with Clio. These mappings were then used as input tothe MapMerge operator. To perform the data exchange, weused the query generation component in Clio to obtain SQLqueries that implement the mappings in the input and outputof MapMerge. These queries were subsequently run on DB2Express-C 9.7. All results were obtained on a Dual Intel Xeon3.4 GHz machine with 4GB of RAM.

Synthetic mapping scenarios. Our synthetic mapping sce-narios follow the pattern of transforming data from adenormalized source schema to a target schema containing anumber of relational hierarchies, with each hierarchy havingat its top an “authority” relation, while other relations referto the authority relation through foreign key constraints. Inthis set of experiments, we focus on hierarchies with twolevels. As an example, Fig. 10 shows such a scenario, wherethe source schema consists of a single relation S, and thetarget schema consists of 2 hierarchies of relations, rootedat relations T1 and T2. Each relation in a hierarchy refersto the root via a foreign key constraint from its F attributeto the K attribute of the root. This type of target schema istypical in ontologies, where a hierarchy of concepts oftenoccurs.

Fig. 10 Synthetic experimental scenario

The synthetic scenarios are parameterized by the numberof hierarchies in the target schema, as well as the numberof relations referring to the root in each hierarchy. For ourexperimental settings, we choose these two parameters to beequal, and their common value n defines what we call thecomplexity of the mapping scenario. In Fig. 10, we show anexample where n = 2. The table in Fig. 11 shows the sizes ofthe experimental scenarios in terms of the number of targetrelations and the execution times for generating Clio map-pings and running the MapMerge operator. We notice thatthe time needed to execute MapMerge is small (less than 2minutes in our largest scenario) but dominates the overallexecution time as the number of target relations grow.

The graphs in Fig. 11 show the results of our experimentson the synthetic mapping scenarios. For each scenario, thesource instance contained 100 tuples populated with ran-domly generated string values. The first graph shows that thetarget instances generated using MapMerge mappings areconsistently (and considerably) smaller than the instancesgenerated using Clio mappings. Here, we took the size ofthe target instance to be its total number of atomic data val-ues (i.e., the product of number of tuples and relation arity,summed across target relations).

The second graph in Fig. 11 shows that the sourceinstances have a higher degree of similarity to these smallertarget MapMerge instances. The degree of similarity of asource instance I to a target instance J is computed as a ratioof Sim(FD(τ (I )), FD(τ (J )) to Sim(FD(τ (I )), FD(τ (I ))),where the latter represents the ideal case where every tuple inFD(τ (I )) is preserved by the target instance and this quantitysimplifies to the expression |FD(τ (I ))|. We notice that thedegree of similarity decreases as the complexity of the map-ping scenario (n) increases. This is because as n increases,more uncorrelated hierarchies are used in the target schema.In turn, this means that the source relation is broken into moreuncorrelated target hierarchies, and hence, they are less sim-ilar to the source. The graph shows that Clio mappings, whencompared to MapMerge mappings, produce target instancesthat are significantly less similar to the source instance inall cases. Furthermore, the relative improvement when usingMapMerge on top of the Clio mappings (shown as the num-bers on top of the bars) increases substantially, as n becomeslarger. Intuitively, this is because most of the Clio mappingswill map the source data into each root and one of its childrelation. On the other hand, MapMerge factors out the com-mon mappings into the root relation and properly correlatesthe generated tuples for the child relation with the tuples inthe root relation. The effect is that all child relations in thehierarchy are correlated by MapMerge while Clio mappingscan only correlate root-child pairs.

Real-life scenarios. We consider two related scenarios fromthe biological domain in this section. In the first scenario,

123

Page 15: MapMerge: correlating independent schema mappings

Correlating independent schema mappings 205

Fig. 11 Experiments on synthetic scenarios

Table 1 Characteristics of real-life mapping scenarios

Mapping Source Target Attribute Clioscenario relations relations correspondences mappings

GeneOntology 3 4 5 4

UniProt 13 10 23 14

we mapped Gene Ontology into BioWarehouse.4 In the sec-ond, we mapped UniProt to BioWarehouse. The BioWare-house documentation specifies the semantics of the datatransformations needed to load data from various biologi-cal databases, including Gene Ontology and UniProt. In theGeneOntology scenario, we extracted 1000 tuples for eachrelation in the source schema of the mapping, while in theUniProt scenario we extracted the first 100 entries for thehuman genome and converted them from XML to a rela-tional format to use as a source instance in our experiments.Table 1 shows the number of source and target tables mapped,the number of correspondences used for each mapping sce-nario, and the number of mapping expressions generated byClio for each scenario.

4 http://biowarehouse.ai.sri.com/.

In Table 2, we show the results of applying MapMerge tothe mappings generated by Clio in each scenario. The gen-eration time columns show the time needed to generate theClio mappings and the time needed by MapMerge to pro-cess those mappings (i.e., the total execution time is the sumof the two times). The size of target instance columns showthe total number of atomic data values on the generated tar-get BioWarehouse instance for each scenario. In both cases,the mappings produced with MapMerge reduced the targetinstance sizes.

The degree of similarity columns present the similaritymeasure for each scenario. This similarity is normalized as apercentage to ease comparison across scenarios and the per-centage is with respect to the ideal similarity that a mappingcan produce for the scenario. As discussed in the context ofthe synthetic scenarios, this ideal similarity is the numberof tuples in the tableau full disjunction of the source, i.e.,|FD(τ (I ))|.

On the two real-life settings, MapMerge is able to furthercorrelate the mappings produced by Clio by reusing behaviorin mappings that span across different target tableaux and,thus, improving the degree of similarity. This improvementis very significant in the UniProt scenario, where the targetschema has a central relation and twelve satellite relationsthat point back to the central relation (via a key/foreign key).

123

Page 16: MapMerge: correlating independent schema mappings

206 B. Alexe et al.

Table 2 Results for real-lifemapping scenarios Mapping

scenarioMapping generation time (s) Size of target instance Degree of similarity (%)

Clio MapMerge Clio MapMerge Clio MapMerge

GeneOntology 1.71 0.34 11557 7801 29.7 35.3

UniProt 2.36 2.13 12923 11446 20.8 75.8

Here, each Clio mapping maps source data to the centraland one satellite relation. MapMerge factors out this com-mon part from all Clio mappings and properly correlates allgenerated tuples to the central relation.

Commercial systems. We implemented some of our syn-thetic scenarios in two commercial mapping systems.Provided with only the correspondences from source attri-butes to target attributes, these systems produced mappingsthat scored lower than both the Clio and MapMerge map-pings with respect to preservation of data associations. Forinstance, in the synthetic scenario of complexity 2, while theMapMerge mappings had a result of 50% and the Clio map-pings 33%, the result for both commercial systems was only16%. The main reason behind this result is that these sys-tems do not automatically take advantage of any constraintspresent on the schemas to better correlate generated data andincrease the preservation of data associations. The mappingsgenerated by these commercial systems need to be manuallyrefined to fix this lack of correlations.

6 Correlating flows of schema mappingswith MapMerge and composition

As discussed in the introduction, we can bring modu-lar design of mappings beyond sets of parallel mappingsbetween the same pair of schemas, toward assembling gen-eral flows of mappings. To generate meaningful end-to-endtransformation specifications for such flows, we have to bringalong into the picture the sequential mapping compositionoperator [9]. This operator can be used to obtain end-to-endmappings from chains of successive mappings. In contrast,MapMerge assembles sets of “parallel” mappings. These twooperators can be leveraged in conjunction in order to corre-late flows of mappings.

Recall the example of the flow of mappings in Fig. 1a.The individual mappings can be assembled into an end-to-end mapping from the schema S1 to the schema S4 throughrepeated applications of the MapMerge and compositionoperators. To exemplify, the specialized mapping for Deptrecords between S2 and S4 is a result of composing the mCS

and m mappings. Furthermore, the right correlations amongthe Dept, Emp, and Proj records that are migrated into S4

can be achieved by applying MapMerge on m1, m2, and theresult mCS ◦ m of the previous composition. We will provide

Fig. 12 The mapping flow correlation algorithm

the complete details of how this mapping flow is correlatedafter we take a closer look at our algorithm.

Flow correlation algorithm. A flow of mappings can bemodeled as a multigraph whose nodes are the schemas andwhose edges are the mappings between the schemas. Recallthat a mapping consists of a pair of source and target schemasas well as a set of constraints specified by SO tgds. In thisalgorithm, a mapping between a source and a target schemais either part of the input, or a consequence of applying Map-Merge or mapping composition. Our algorithm assumes thatthe graph of mappings is acyclic. The algorithm, which isshown in Fig. 12, proceeds through alternative phases ofapplying the MapMerge and mapping composition opera-tors and terminates when no further progress can be made.In a MapMerge phase, the multigraph modeling the flow isessentially transformed into a regular graph. For any pair ofschemas Si , S j , the set of mappings Mi j going from Si toS j is replaced by the result of applying MapMerge on Mi j .In a mapping composition phase, for any distinct schemasSi , S j , Sk in the flow such that M1 is a mapping from Si to S j

and M2 is a mapping from S j to Sk , the result M = M1 ◦M2

123

Page 17: MapMerge: correlating independent schema mappings

Correlating independent schema mappings 207

(a)

(b) (c)

(d) (e)

(f)

Fig. 13 Illustration of the flow correlation algorithm

of composing M1 and M2 is added to the flow. We use herethe mapping composition algorithm in [9], since it applies toschema mappings specified by SO tgds.

Our correlation algorithm keeps track, via the set C, of themappings being added to the flow in the composition phase.As a result, a mapping is not readded to the flow if the resultof composing the same mappings was computed and addedto the flow previously in the execution of the algorithm. Thealgorithm terminates when no new mappings can be added tothe flow in the composition phase, and returns the correlatedflow of mappings M. After executing this algorithm, the flowof mappings will contain at most one mapping between eachpair of schemas (with the understanding that each mappingis typically a set of correlated formulas).

Illustration of the flow correlation algorithm. In the fol-lowing, we will demonstrate in some detail the execution ofour flow correlation algorithm on the mapping scenario inFig. 1a. This scenario is depicted as a multigraph in Fig. 13a.The nodes of the graph are the schemas S1, S2, S3, and S4.The edges connecting these nodes are the initial schema map-pings, as defined in Fig. 1a. For simplicity, when a mappingconsists of a single constraint, we will often drop the setnotation, e.g., we will use m as a shorthand to denote themapping specified by the set of constraints {m}. The flowalgorithm will iterate and successively apply the MapMerge

and composition operators. We will examine each step of thealgorithm.

First, the multigraph in Fig. 13a is transformed into aregular graph by using MapMerge to correlate the multi-ple mappings connecting the schemas S1, S2, and S2, S4,respectively. The result is depicted in Fig. 13b. The new edgebetween S1 and S2 is the mapping M1, which is the result ofapplying MapMerge on the input constraints t1, t2, t3, whichare specified in Fig. 1b. The result M1 is precisely the map-ping in Fig. 1c. The derivation of this mapping via MapMergewas explained in Sects. 3 and 4.

Note that schemas S2 and S4 are also connected by mul-tiple mappings: m1 and m2, which are shown below:

m1 :for e in Empexists c in CompSci, s in c.Staffwhere s.ename = e.ename and

s.address = e.addr

m2 :for p1 in Projexists c in CompSci, p2 in c.Projectswhere p2.pname = p1.pname and

p2.budget = p1.budget

123

Page 18: MapMerge: correlating independent schema mappings

208 B. Alexe et al.

In the same first step of the flow algorithm, MapMergeattempts to correlate these mappings. However, after split-ting them into basic SO tgds, there is no skeleton on whichbasic SO tgds from different input mappings are matched.Hence, no further correlation is possible here, and the result-ing mapping M2, although syntactically different, is logicallyequivalent to the set {m1, m2}. We show the mapping M2

below. The new Skolem functions are introduced since theinput constraints do not specify any mapping behavior forthe CompSci set at the root of the schema S4. Moreover, thefor clauses reflect the matching on the skeletons of schemaS2 which extend from the Emp and Proj relations via foreignkey constraints.

M2:for e in Emp, d in Deptsatisfying e.did = d.didexists c in CompSci, s in c.Staffwhere c.did = F1[e] and c.dname = F2[e] and

c.Staff = F3[e] and c.Projects = F4[e] ands.ename = e.ename ands.address = e.address

for p1 in Proj, d in Deptsatisfying p1.did = d.didexists c in CompSci, p2 in c.Projectswhere c.did = G1[p1] and c.dname = G2[p1] and

c.Staff = G3[p1] and c.Projects = G4[p1] andp2.pname = p1.pname andp2.budget = p1.budget

In the second step, the flow of mappings evolves from thestate in Fig. 13b to the state in Fig. 13c. This corresponds tothe second phase of the first iteration through the loop of theflow algorithm: the mapping composition phase. The algo-rithm identifies pairs of mappings such that the target schemaof the first mapping is the source schema of the second map-ping and applies the mapping composition operator on them.In Fig. 13b, there are three such pairs: (M1, M2), (M1, mCS),and (mCS, m). Let us examine more closely the first pair ofmappings in this list that will be subject to composition. Intu-itively, the mapping composition operator tries to match theatoms in the for clauses of the M2 assertions with the existsclauses of the constraints in M1. When a match occurs, theatom in the for clause of M2 is “unfolded” based on thefor clause of the M1 constraint that matched. For instance,the e in Emp atom in the first constraint of M2 matches theexists clause of the second constraint in the mapping M1

(see Fig. 1c). Hence, the e in Emp atom will be expandedusing the for clause of that constraint. We omit here the for-mal details of the mapping composition algorithm, whichincludes some additional steps, and refer the interested reader

to [9]. We show below the schema mapping M3 that resultsafter composing M1 and M2.

M3:for g in Group, w in Workssatisfying g.gno =w.gno and w.addr = “NY”exists c in CompSci, s in c.Staffwhere c.did = F1[w.ename, w.addr, F[g]] and

c.dname = F2[w.ename, w.addr, F[g]] andc.Staff = F3[w.ename, w.addr, F[g]] andc.Projects = F4[w.ename, w.addr, F[g]] ands.ename = w.ename ands.address =w.address

for g in Group, w in Workssatisfying g.gno =w.gno and w.addr = “NY”exists c in CompSci, p in c.Projectswhere c.did = G1[w.pname, H1[w], F[g]] and

c.dname = G2[w.pname, H1[w], F[g]] andc.Staff = F3[w.pname, H1[w], F[g]] andc.Projects = F4[w.pname, H1[w], F[g]] andp.pname = w.pname andp.budget = H1[w]

Note that as a result of the composition operator, Staffand Projects records in the target schema S4 are assertednow directly in terms of the Group and Works records inthe schema S1. Similar applications of mapping composi-tion yield in a straightforward manner the definitions ofM4 = M1 ◦ mCS and M5 = mCS ◦ m. They are presentedbelow, along with the initial specifications of mCS and m.(The new Skolem functions K1 are K2 are introduced in msince this mapping does not assert any identifiers for the Staffand Projects sets.)

mCS:for d in Deptsatisfying d.dname = “CS”exists c in CSDeptwhere c.did = d.did and

c.dname = d.dnamem:

for c1 in CSDeptexists c2 in CompSciwhere c2.did = c1.did and

c2.dname = c1.dname andc2.Staff = K1[c1] andc2.Projects = K2[c1]

123

Page 19: MapMerge: correlating independent schema mappings

Correlating independent schema mappings 209

M4:for g in Groupsatisfying g.gname = “CS”exists c in CSDeptwhere c.did = F[g] and

c.dname = g.gname

M5:for d in Deptsatisfying d.dname = “CS”exists c in CompSciwhere c.did = d.did and

c.dname = d.dname andc.Staff = K1[d] andc.Projects = K2[d]

Note that in the mapping M4, through the compositionoperator, the filter on the value of dname from mCS has been“pushed back” to the Group record from S1. This followsintuition, since the dname attribute in Dept and the gnameattribute in Group are put into correspondence via M1. Fur-thermore, the mapping M5 asserts CompSci records in S4 interms of Dept records in S2.

The following step of the algorithm takes the flow inFig. 13c and applies MapMerge to assemble the parallel map-pings M2 and M5 between the schemas S2 and S4 into themapping M6, as in Fig. 13d. Before explaining the result, welist here the constraints forming the specification of M6:

M6:for d in Deptsatisfying d.dname = “CS”exists c in CompSciwhere c.did = d.did and c.dname = d.dname

c.Staff = K1[d] and c.Projects = K2[d]

for e in Emp, d in Deptsatisfying e.did = d.did and d.dname = “CS”exists c in CompSci, s in c.Staffwhere c.did = d.did and c.dname = d.dname and

c.Staff = K1[d] and c.Projects = K2[d] ands.ename = e.ename and s.address = e.addr

for p1 in Proj, d in Deptsatisfying p1.did = d.did and d.dname = “CS”exists c in CompSci, p2 in c.Projectswhere c.did = d.did and c.dname = d.dname and

c.Staff = K1[d] and c.Projects = K2[d] andp2.pname = p1.pname and p2.budget = p1.budget

There are two important changes in M6 with respect to theinput mappings M2 and M5. These changes are due to the fact

that MapMerge allows the more specific constraints that mapEmp and Proj records to inherit mapping behavior from themore general constraint that maps Dept records. First, Empand Proj records are now only mapped if they are associatedwith the “CS” Dept record. This is intuitive, since the M5 toplevel mapping only maps Dept records with dname = “CS”.Second, in the input constraints of M2, target Staff and Pro-jects records are asserted under uncorrelated root CompScirecords. This is no longer the case in the resulting mappingM6 after MapMerge where Emp and Proj records are nowmapped precisely under the CompSci record created for theDept record they are associated with in schema S2. This cor-relation is made possible by MapMergeand is implementedvia the reuse of the K1 and K2 Skolem functions across thecomponents of M6.

In the next step, the flow correlation algorithm appliesmapping composition twice, to add the mappings M7 = M1◦M6 and M8 = M4 ◦ m to the flow. The results are basicallyrewritings of the assertions in M6 and m in terms of S1.Moreover, all the user-defined filters are now consolidatedin the end-to-end mapping, as they accumulated on the com-position path from S1 to S4. For completeness, we state theconstraints specifying M7 and M8 below.

M7:for g in Groupsatisfying g.gname = “CS”exists c in CompSciwhere c.did = F[g] and c.dname = g.gname

c.Staff = K1[F[g], g.gname] andc.Projects = K2[F[g], g.gname]

for w in Works, g in Groupsatisfying w.did = g.gno and

w.addr = “NY” and g.gname = “CS”exists c in CompSci, s in c.Staffwhere c.did = F[g] and c.dname = g.gname and

c.Staff = K1[F[g], g.gname] andc.Projects = K2[F[g], g.gname] ands.ename = w.ename and s.address =w.addr

for w in Works, g in Groupsatisfying w.gno = g.gno and

w.addr = “NY” and g.gname = “CS”exists c in CompSci, p in c.Projectswhere c.did = F[g] and c.dname = g.gname and

c.Staff = K1[F[g], g.gname] andc.Projects = K2[F[g], g.gname] andp.pname = w.pname and p.budget = H1[w]

M8:for g in Groupsatisfying g.gname = “CS”

123

Page 20: MapMerge: correlating independent schema mappings

210 B. Alexe et al.

exists c in CompSciwhere c.did = F[g] and c.dname = g.gname

c.Staff = K1[F[g], g.gname] andc.Projects = K2[F[g], g.gname]

The final step in the execution of the flow algorithm onour example is applying MapMerge to correlate the map-pings M3, M7, and M8 into M9. The change in the state ofthe mapping flow can be observed in Fig. 13e, f. The result-ing mapping M9 is specified by the constraints below, wherewe compacted the rules asserting Staff and Projects into oneconstraint, while maintaining logical equivalence.

M9:for g in Groupsatisfying g.gname = “CS”exists c in CompSciwhere c.did = F[g] and c.dname = g.gname

c.Staff = K1[F[g], g.gname] andc.Projects = K2[F[g], g.gname]

for w in Works, g in Groupsatisfying w.did = g.gno and

w.addr = “NY” and g.gname = “CS”exists c in CompSci, s in c.Staff, p in c.Projectswhere c.did = F[g] and c.dname = g.gname and

c.Staff = K1[F[g], g.gname] andc.Projects = K2[F[g], g.gname] ands.ename = w.ename and s.address =w.addr andp.pname = w.pname and p.budget = H1[w]

After this application of MapMerge the composition oper-ator does not introduce any new mappings,5 and therefore,the execution of the flow algorithm terminates. The map-ping M9 supports the end-to-end transformation from S1 toS4 through the flow of intermediate schemas and mappingsinitially specified. This mapping has intuitively the “right”semantics: source Group records that are labeled “CS” aremigrated to the target as CompSci records, while project andemployee information from associated Works records thatpass the filter on the addr attribute are migrated into Staffand Projects target records correctly nested under the corre-sponding CompSci root target record.

Note that, since our final mapping is equivalent to M7,it could theoretically be obtained in a more direct way, asM1 ◦ M6. Written as an expression in terms of the inputmappings and the two operators, the final mapping is

MapMerge(t1, t2, t3) ◦ MapMerge(m1, m2, mCS ◦ m)

5 i.e., that are not already in the set C.

The total number of steps in the flow correlation algorithmis determined by how many new composition steps can beapplied. The total number of composition steps is boundedby the total number of paths that can exist in the mappinggraph. Since the graph is acyclic, this number is a polynomialin the size of the graph. Note that the individual algorithmsfor MapMerge and mapping composition are exponential inthe size of the input constraints.

As the preceding example showed, it may be the casethat not all the steps are needed and the final result maybe obtained in a more direct way. Further optimization of thealgorithm to minimize the amount of work needed to achievethe final mappings is an open research question.

We also note that mapping composition preserves logicalequivalence, and hence, the similarity measure of Sect. 5.1is always preserved in the composition phase of the algo-rithm. For MapMerge, which is the component of the flowcorrelation algorithm that does not preserve logical equiva-lence, we have already given an experimental justification ofits goodness based on the similarity measure in Sect. 5.2.

7 Related work

Model management [17] has considered various operatorson schema mappings, among which Confluence is closest inspirit to MapMerge. Confluence also operates on mappingswith the same source and target schema, and it amounts totaking the conjunction of the constraints in the input map-pings. Thus, Confluence does not attempt any correlation ofthe input mappings. Our work can be seen as a step towardthe high-level design and optimization in ETL flows [22,23].This can be envisioned by incorporating mappings [5] intosuch flows and employing operators such as MapMerge andcomposition to support modularity and reuse.

The instance similarity measure we used to evaluate Map-Merge is inspired from the very general notion of Hausdorffdistance between subsets of metric spaces and from the sumof minimum distances measure. We refer to [6] for a discus-sion of these measures. Moreover, our notion of tuple simi-larity is loosely based on the well-known Jaccard coefficient.However, the previous measures are symmetric and agnos-tic to the transformation that produces one database instancefrom the other. In contrast, our notion is tailored to measurethe preservation of data associations from a source databaseto a target database under a schema mapping.

8 Conclusions

We have presented our MapMerge algorithm and an evalua-tion of our implementation of MapMerge. Through a simi-larity measure that computes the amount of data associations

123

Page 21: MapMerge: correlating independent schema mappings

Correlating independent schema mappings 211

that are preserved from one instance to another, our evalua-tion shows that a given source instance has higher similar-ity to the target instance obtained through MapMerge whencompared to target instances obtained through other mappingtools. As part of our future work, we intend to explore theuse of the notion of information loss [10] to give a theoreticaljustification of the benefits of using MapMerge.

In this paper, we have also presented an algorithm that canbe used to correlate flows of mappings by sequencing applica-tions of the MapMerge and mapping composition operators.This algorithm makes uniform use of the language of SOtgds for both MapMerge and mapping composition. Whileour current algorithm is exhaustive and explores all possibil-ities of applying MapMerge and composition, we envisionthat subsequent optimization techniques will further reducethe amount of work needed to produce the final mapping.We plan to investigate such techniques as part of our futurework. In addition, we would like to further explore the appli-cability of our flow correlation algorithm to various practicaldata integration scenarios.

Acknowledgments Mauricio Hernández and Lucian Popa were par-tially funded by the U.S. Air Force Office for Scientific Research undercontracts FA9550-07-1-0223 and FA9550-06-1-0226. Part of the workwas done while Bogdan Alexe was a Ph.D. candidate at UC SantaCruz, supported by NSF grant IIS-0430994 and NSF grant IIS-0905276.Wang-Chiew Tan is on leave from UC Santa Cruz, and this work wassupported by an NSF CAREER award IIS-0347065.

References

1. Alexe, B., Gubanov, M., Hernández, M.A., Ho, H., Huang, J.W.,Katsis, Y., Popa, L., Saha, B., Stanoi, I.: Simplifying informationintegration: object-based flow-of-mappings framework for integra-tion. In: BIRTE, pp. 108–121. Springer, Berlin (2009)

2. Alexe, B., Hernández, M.A., Popa, L., Tan, W.C.: MapMerge:correlating independent schema mappings. PVLDB 3(1), 81–92(2010)

3. Beeri, C., Vardi, M.Y.: A proof procedure for data dependen-cies. JACM 31(4), 718–741 (1984)

4. Bonifati, A., Chang, E.Q., Ho, T., Lakshmanan, L.V.S., Pottinger,R.: HePToX: marrying XML and heterogeneity in your P2P dat-abases. In: VLDB, pp. 1267–1270 (2005)

5. Dessloch, S., Hernández, M.A., Wisnesky, R., Radwan, A., Zhou,J.: Orchid: integrating schema mapping and ETL. In: ICDE,pp. 1307–1316 (2008)

6. Eiter, T., Mannila, H.: Distance measures for point sets and theircomputation. Acta Inform. 34(2), 109–133 (1997)

7. Fagin, R., Haas, L.M., Hernández, M.A., Miller, R.J., Popa, L.,Velegrakis, Y.: Clio: schema mapping creation and data exchange.In: Borgida, A., Chaudhri, V.K., Giorgini, P., Yu, E.S.K. (eds.) Con-ceptual Modeling: Foundations and Applications, pp. 198–236.Springer, Berlin (2009)

8. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange:semantics and query answering. TCS 336(1), 89–124 (2005)

9. Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.: Composing schemamappings: second-order dependencies to the rescue. TODS 30(4),994–1055 (2005)

10. Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Reverse dataexchange: coping with nulls. In: PODS, pp. 23–32 (2009)

11. Fuxman, A., Hernández, M.A., Ho, C.T.H., Miller, R.J., Papotti, P.,Popa, L.: Nested mappings: schema mapping reloaded. In: VLDB,pp. 67–78 (2006)

12. Galindo-Legaria, C.A.: Outerjoins as disjunctions. In: SIGMODConference, pp. 348–358 (1994)

13. Kolaitis, P.G.: Schema mappings, data exchange, and metadatamanagement. In: PODS, pp. 61–75 (2005)

14. Lenzerini, M.: Data integration: a theoretical perspective. In:PODS, pp. 233–246 (2002)

15. Madhavan, J., Halevy, A.Y.: Composing mappings among datasources. In: VLDB, pp. 572–583 (2003)

16. Maier, D., Mendelzon, A.O., Sagiv, Y.: Testing implications of datadependencies. TODS 4(4), 455–469 (1979)

17. Melnik, S., Bernstein, P.A., Halevy, A.Y., Rahm, E.: Support-ing executable mappings in model management. In: SIGMOD,pp. 167–178 (2005)

18. Nash, A., Bernstein, P.A., Melnik, S.: Composition of mappingsgiven by embedded dependencies. ACM Trans. Database Syst.32(1), 4 (2007)

19. Popa, L., Velegrakis, Y., Miller, R.J., Hernández, M.A., Fagin, R.:Translating web data. In: VLDB, pp. 598–609 (2002)

20. Rahm, E., Bernstein, P.A.: A survey of approaches to automaticschema matching. VLDB J. 10(4), 334–350 (2001)

21. Rajaraman, A., Ullman, J.D.: Integrating information by outerjoinsand full disjunctions. In: PODS, pp. 238–248 (1996)

22. Simitsis, A., Vassiliadis, P., Sellis, T.K.: Optimizing ETL processesin data warehouses. In: ICDE, pp. 564–575 (2005)

23. Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Modeling ETL activ-ities as graphs. In: DMDW, pp. 52–61 (2002)

24. Velegrakis, Y., Miller, R.J., Popa, L.: Mapping adaptation underevolving schemas. In: VLDB, pp. 584–595 (2003)

25. Yu, C., Popa, L.: Semantic adaptation of schema mappings whenschemas evolve. In: VLDB, pp. 1006–1017 (2005)

123