for Functional Dependencies at Scale ChaseFUN: a Data … · ChaseFUN: a Data Exchange Engine for...

ChaseFUN: a Data Exchange Enginefor Functional Dependencies at Scale

Angela Bonifati‹,: Ioana Ileana♦,; Michele Linardi;[email protected] [email protected] [email protected]

‹CNRS LIRIS :Univ Lyon 1 ♦INSERM IDS ;Univ Paris Descartes

1. ChaseFUN: motivation and goalsChaseFUN is the first Data Exchange (DE) engine targeting

• piecemeal process and parallelization of DE constraints during thechase by playing with constraint ordering and interaction; and

• interplay and interaction among constraints in order to reduce thesize of intermediate results of the chase while providing granularinsight into it.

As such, ChaseFUN addresses the coverage and efficient support of targetfunctional dependencies lacking in available DE engines.

2. Classical DE Setting via an exampleProcess of taking data structured under a source schema and transform-ing it into data structured under a target schema, by exploiting thedependencies of the schemas.

Active_Actorsname surname age

Leonardo Di Caprio 40John Redmayne 33

Awarded_Actorname surname oscarName yearJohn Redmayne Best Actor 2014

Wallace Beery Best Actor 1932Fredric March Best Actor 1932Marlon Brando Jr. Best Actor 1954Marlon Brando Jr. Best Actor 1972

Actor_Collaborationname1 surname1 name2 surname2

Leonardo Di Caprio Matthew DavidFredric March Miriam Hopkins

S-t tgds:m1 : Active_Actorpn, s, aq Ñ Actorpn, s,Y1,Y2q

m2 : Awarded_Actorpn1, s1, p1,w1q Ñ Actorpn1, s1,T ,T1q

^Oscar_Prizepp1,w1,T q

m3 : Actor_Collpn1, s1,n2, s2q Ñ Actorpn1, s1,E1,E2q ^ Actorpn2, s2,E3,E2q

Target fds: e1 : Actorpn, s, p,wq ^ Actorpn, s, p1,w1q Ñ pp “ p1

q ^ pw “ w1q

e2 : Oscar_Prizepp,w, zq ^Oscar_Prizepp,w, z1q Ñ pz “ z1

q

Classical DE Chase phase 1: Apply s-t tgdsm1 : Active_Actorpn, s, aq Ñ Actorpn, s,Y1,Y2q

Pre-solution - Actor tablename surname idRewarding idClub

Leonardo Di Caprio N1 N2

John Redmayne N3 N4

... ... ... ...

Classical DE Chase phase 2: Apply target fdsPre-solution - Actor table

name surname idRewarding idClubLeonardo Di Caprio N1 N2


... ... ... ...Target fd e1 : Actorpn, s, p,wq ^ Actorpn, s, p1,w1

q Ñ pp “ p1q ^ pw “ w1

q

Solution - Actor tablename surname idRewarding idClub


... ... ... ...Actor

name surname idReward idClubJohn Redmayne N5 N6

Wallace Beery N7 N8Marlon Brando Jr. N13 N14

Leonardo Di Caprio N15 N16Matthew David N17 N16Fredric March N18 N19Miriam Hopkins N20 N19

Oscar_PrizeoscarName year idActorBest Actor 2014 N5Best Actor 1932 N7Best Actor 1954 N13Best Actor 1972 N13

3. The Interleaved Chase for efficient and scalable DEBig Classical DE Chase Issue: fd-induced overhead, due to the

often very large fd application scope (pre-solution...)The Interleaved Chase, at the heart of ChaseFUN, mitigates

fd-induced overhead by cleverly taming fd application scope.

The Interleaved Chase plays on s-t tgds assignments = mappings froms-t tgds variables to constants or labeled nulls. Assignments areconstructed in an initial form for every DE scenario, and then chased:

• Chase steps with tgds add assignments to a target assignment set

• Steps with egds (fds) change assignments in the target set.

• The (intermediate) target solution can be obtained at any point bymaterializing the target assignment set, i.e. producing atoms in thes-t tgds heads with the values in the assignments’ images.

4. Saturation Sets and OverlapsFds apply on the target assignment set - how can one keep the target setlow across the chase, so as to tame fd scope?

IDEA: split the chase into small, independent units!

Saturation Sets are subsets S of the set of tgd assignments A for ascenario, such that an assignment in S is guaranteed never tointeract via fd application with an assignment in A´ S. SaturationSets are independent chase units!

The Interleaved Chase (and ChaseFUN)’s mission: buildingsmall Saturation Sets, by grouping together assignments thatinteract now or are suspected to interact later:

• Overlap of two assignments = pairs of equal (or prone to beequal) variables and corresponding involved fd

• Two overlapping assignments are placed in the same Saturation Set.

Example: the assignments:am11 = {n:Leonardo, s:Di Caprio, a:42, Y1:N1, Y2:N2}

am31 = {n1:Leonardo, s1:Di Caprio, n2:Matthew, s2:David, E1:N15, E2:N16,

E3:N17}overlap on xn, n1

y, xs, s1y and e1.

5. The Conflict Graph and parallelizationThe Conflict Graph is a structure accounting for constraintsinteraction and a helper for finding overlapping assignments. ConflictGraph nodes correspond to s-t tgds, whereas an edge means that the twotgds may have overlapping assignments, further characterized by conflictareas adorning nodes.

v1v2 v3

Areaspv1q “ tca11 “ xpn, sq, e1yu.

Areaspv2q “ tca12 “ xpn

1, s1q, e1yu.

Areaspv3q “ tca13 “ xpn

2, s2q, e1y, ca

23 “ xpn

3, s3q, e1yu.

Nice news: a Saturation Set cannot span two distinct connectedcomponents of the graph. Besides speeding up Saturation Setconstruction, the Conflict Graph thus provides ChaseFUN with aparallelization opportunity, enhacing its scalability and overall speed!

6. Comparative performance assesmentWe stress-tested ChaseFUN using several large, iBench generated scenar-ios: OF scenarios (iBench Object Fusion), OF+ scenarios (OF + iBenchVertical Partitioning), OF++ scenarios (augmented OF+). On those, wehave moreover compared to one of the fastest state-of-the art DE engines:the Llunatic engine.

ScenariosSCENARIO s-t tgds OF OF+ OF++ # source tuples

A 15 5 egds 10 egds 15 egds 500KC 45 15 egds 30 egds 45 egds 1.5MF 90 30 egds 60 egds 90 egds 3M

References[1] A. Bonifati, I. Ileana, M. Linardi. Functional Dependencies Unleashedfor Scalable Data Exchange. In Proc. of SSDBM, 2016 https://arxiv.org/abs/1602.00563

for Functional Dependencies at Scale ChaseFUN: a Data … · ChaseFUN: a Data Exchange Engine for...

Documents

Transcript of for Functional Dependencies at Scale ChaseFUN: a Data … · ChaseFUN: a Data Exchange Engine for...