for Functional Dependencies at Scale ChaseFUN: a Data … · ChaseFUN: a Data Exchange Engine for...

1
ChaseFUN: a Data Exchange Engine for Functional Dependencies at Scale Angela Bonifati ,: Ioana Ileana ,; Michele Linardi ; [email protected] [email protected] [email protected] CNRS LIRIS : Univ Lyon 1 INSERM IDS ; Univ Paris Descartes 1. ChaseFUN: motivation and goals ChaseFUN is the first Data Exchange (DE) engine targeting piecemeal process and parallelization of DE constraints during the chase by playing with constraint ordering and interaction; and interplay and interaction among constraints in order to reduce the size of intermediate results of the chase while providing granular insight into it. As such, ChaseFUN addresses the coverage and efficient support of target functional dependencies lacking in available DE engines. 2. Classical DE Setting via an example Process of taking data structured under a source schema and transform- ing it into data structured under a target schema, by exploiting the dependencies of the schemas. Active_Actors name surname age Leonardo Di Caprio 40 John Redmayne 33 Awarded_Actor name surname oscarName year John Redmayne Best Actor 2014 Wallace Beery Best Actor 1932 Fredric March Best Actor 1932 Marlon Brando Jr. Best Actor 1954 Marlon Brando Jr. Best Actor 1972 Actor_Collaboration name 1 surname 1 name 2 surname 2 Leonardo Di Caprio Matthew David Fredric March Miriam Hopkins S-t tgds: m 1 : Active_Actor pn, s, aActor pn, s, Y 1 , Y 2 q m 2 : Awarded_Actor pn 1 , s 1 , p 1 , w 1 Actor pn 1 , s 1 , T , T 1 q ^Oscar _P rizepp 1 , w 1 , T q m 3 : Actor _Collpn 1 , s 1 , n 2 , s 2 Actor pn 1 , s 1 , E 1 , E 2 q^ Actor pn 2 , s 2 , E 3 , E 2 q Target fds: e 1 : Actor pn, s, p, wq^ Actor pn, s, p 1 , w 1 qÑpp p 1 q^pw w 1 q e 2 : Oscar _P rizepp, w, z q^ Oscar _P rizepp, w, z 1 qÑpz z 1 q Classical DE Chase phase 1: Apply s-t tgds m 1 : Active_Actor pn, s, aActor pn, s, Y 1 , Y 2 q Pre-solution - Actor table name surname idRewarding idClub Leonardo Di Caprio N 1 N 2 John Redmayne N 3 N 4 ... ... ... ... Classical DE Chase phase 2: Apply target fds Pre-solution - Actor table name surname idRewarding idClub Leonardo Di Caprio N 1 N 2 Leonardo Di Caprio N 15 N 16 ... ... ... ... Target fd e 1 : Actor pn, s, p, wq^ Actor pn, s, p 1 , w 1 qÑpp p 1 q^pw w 1 q Solution - Actor table name surname idRewarding idClub Leonardo Di Caprio N 15 N 16 ... ... ... ... Actor name surname idReward idClub John Redmayne N 5 N 6 Wallace Beery N 7 N 8 Marlon Brando Jr. N 13 N 14 Leonardo Di Caprio N 15 N 16 Matthew David N 17 N 16 Fredric March N 18 N 19 Miriam Hopkins N 20 N 19 Oscar_Prize oscarName year idActor Best Actor 2014 N 5 Best Actor 1932 N 7 Best Actor 1954 N 13 Best Actor 1972 N 13 3. The Interleaved Chase for efficient and scalable DE Big Classical DE Chase Issue: fd-induced overhead, due to the often very large fd application scope (pre-solution...) The Interleaved Chase, at the heart of ChaseFUN, mitigates fd-induced overhead by cleverly taming fd application scope. The Interleaved Chase plays on s-t tgds assignments = mappings from s-t tgds variables to constants or labeled nulls. Assignments are constructed in an initial form for every DE scenario, and then chased: Chase steps with tgds add assignments to a target assignment set Steps with egds (fds) change assignments in the target set. The (intermediate) target solution can be obtained at any point by materializing the target assignment set, i.e. producing atoms in the s-t tgds heads with the values in the assignments’ images. 4. Saturation Sets and Overlaps Fds apply on the target assignment set - how can one keep the target set low across the chase, so as to tame fd scope? IDEA: split the chase into small, independent units! Saturation Sets are subsets S of the set of tgd assignments A for a scenario, such that an assignment in S is guaranteed never to interact via fd application with an assignment in A ´ S . Saturation Sets are independent chase units! The Interleaved Chase (and ChaseFUN)’s mission: building small Saturation Sets, by grouping together assignments that interact now or are suspected to interact later: Overlap of two assignments = pairs of equal (or prone to be equal) variables and corresponding involved fd Two overlapping assignments are placed in the same Saturation Set. Example: the assignments: a m1 1 ={n:Leonardo, s:Di Caprio, a:42, Y 1 :N 1 , Y 2 :N 2 } a m3 1 ={n 1 :Leonardo, s 1 :Di Caprio, n 2 :M atthew , s 2 :David, E 1 :N 15 , E 2 :N 16 , E 3 :N 17 } overlap on xn, n 1 y, xs, s 1 y and e 1 . 5. The Conflict Graph and parallelization The Conflict Graph is a structure accounting for constraints interaction and a helper for finding overlapping assignments. Conflict Graph nodes correspond to s-t tgds, whereas an edge means that the two tgds may have overlapping assignments, further characterized by conflict areas adorning nodes. v 1 v 2 v 3 Areaspv 1 q“tca 1 1 “ xpn, sq, e 1 yu. Areaspv 2 q“tca 1 2 “ xpn 1 , s 1 q, e 1 yu. Areaspv 3 q“tca 1 3 “ xpn 2 , s 2 q, e 1 y, ca 2 3 “ xpn 3 , s 3 q, e 1 yu. Nice news: a Saturation Set cannot span two distinct connected components of the graph. Besides speeding up Saturation Set construction, the Conflict Graph thus provides ChaseFUN with a parallelization opportunity, enhacing its scalability and overall speed! 6. Comparative performance assesment We stress-tested ChaseFUN using several large, iBench generated scenar- ios: OF scenarios (iBench Object Fusion), OF+ scenarios (OF + iBench Vertical Partitioning), OF++ scenarios (augmented OF+). On those, we have moreover compared to one of the fastest state-of-the art DE engines: the Llunatic engine. Scenarios SCENARIO s-t tgds OF OF+ OF++ # source tuples A 15 5 egds 10 egds 15 egds 500K C 45 15 egds 30 egds 45 egds 1.5M F 90 30 egds 60 egds 90 egds 3M References [1] A. Bonifati, I. Ileana, M. Linardi. Functional Dependencies Unleashed for Scalable Data Exchange. In Proc. of SSDBM, 2016 https://arxiv. org/abs/1602.00563

Transcript of for Functional Dependencies at Scale ChaseFUN: a Data … · ChaseFUN: a Data Exchange Engine for...

Page 1: for Functional Dependencies at Scale ChaseFUN: a Data … · ChaseFUN: a Data Exchange Engine for Functional Dependencies at Scale Angela Bonifati ,: Ioana Ileana},; Michele Linardi;

ChaseFUN: a Data Exchange Enginefor Functional Dependencies at Scale

Angela Bonifati‹,: Ioana Ileana♦,; Michele Linardi;[email protected] [email protected] [email protected]

‹CNRS LIRIS :Univ Lyon 1 ♦INSERM IDS ;Univ Paris Descartes

1. ChaseFUN: motivation and goalsChaseFUN is the first Data Exchange (DE) engine targeting

• piecemeal process and parallelization of DE constraints during thechase by playing with constraint ordering and interaction; and

• interplay and interaction among constraints in order to reduce thesize of intermediate results of the chase while providing granularinsight into it.

As such, ChaseFUN addresses the coverage and efficient support of targetfunctional dependencies lacking in available DE engines.

2. Classical DE Setting via an exampleProcess of taking data structured under a source schema and transform-ing it into data structured under a target schema, by exploiting thedependencies of the schemas.

Active_Actorsname surname age

Leonardo Di Caprio 40John Redmayne 33

Awarded_Actorname surname oscarName yearJohn Redmayne Best Actor 2014

Wallace Beery Best Actor 1932Fredric March Best Actor 1932Marlon Brando Jr. Best Actor 1954Marlon Brando Jr. Best Actor 1972

Actor_Collaborationname1 surname1 name2 surname2

Leonardo Di Caprio Matthew DavidFredric March Miriam Hopkins

S-t tgds:m1 : Active_Actorpn, s, aq Ñ Actorpn, s,Y1,Y2q

m2 : Awarded_Actorpn1, s1, p1,w1q Ñ Actorpn1, s1,T ,T1q

^Oscar_Prizepp1,w1,T q

m3 : Actor_Collpn1, s1,n2, s2q Ñ Actorpn1, s1,E1,E2q ^ Actorpn2, s2,E3,E2q

Target fds: e1 : Actorpn, s, p,wq ^ Actorpn, s, p1,w1q Ñ pp “ p1

q ^ pw “ w1q

e2 : Oscar_Prizepp,w, zq ^Oscar_Prizepp,w, z1q Ñ pz “ z1

q

Classical DE Chase phase 1: Apply s-t tgdsm1 : Active_Actorpn, s, aq Ñ Actorpn, s,Y1,Y2q

Pre-solution - Actor tablename surname idRewarding idClub

Leonardo Di Caprio N1 N2

John Redmayne N3 N4

... ... ... ...

Classical DE Chase phase 2: Apply target fdsPre-solution - Actor table

name surname idRewarding idClubLeonardo Di Caprio N1 N2

Leonardo Di Caprio N15 N16

... ... ... ...Target fd e1 : Actorpn, s, p,wq ^ Actorpn, s, p1,w1

q Ñ pp “ p1q ^ pw “ w1

q

Solution - Actor tablename surname idRewarding idClub

Leonardo Di Caprio N15 N16

... ... ... ...Actor

name surname idReward idClubJohn Redmayne N5 N6

Wallace Beery N7 N8Marlon Brando Jr. N13 N14

Leonardo Di Caprio N15 N16Matthew David N17 N16Fredric March N18 N19Miriam Hopkins N20 N19

Oscar_PrizeoscarName year idActorBest Actor 2014 N5Best Actor 1932 N7Best Actor 1954 N13Best Actor 1972 N13

3. The Interleaved Chase for efficient and scalable DEBig Classical DE Chase Issue: fd-induced overhead, due to the

often very large fd application scope (pre-solution...)The Interleaved Chase, at the heart of ChaseFUN, mitigates

fd-induced overhead by cleverly taming fd application scope.

The Interleaved Chase plays on s-t tgds assignments = mappings froms-t tgds variables to constants or labeled nulls. Assignments areconstructed in an initial form for every DE scenario, and then chased:

• Chase steps with tgds add assignments to a target assignment set

• Steps with egds (fds) change assignments in the target set.

• The (intermediate) target solution can be obtained at any point bymaterializing the target assignment set, i.e. producing atoms in thes-t tgds heads with the values in the assignments’ images.

4. Saturation Sets and OverlapsFds apply on the target assignment set - how can one keep the target setlow across the chase, so as to tame fd scope?

IDEA: split the chase into small, independent units!

Saturation Sets are subsets S of the set of tgd assignments A for ascenario, such that an assignment in S is guaranteed never tointeract via fd application with an assignment in A´ S. SaturationSets are independent chase units!

The Interleaved Chase (and ChaseFUN)’s mission: buildingsmall Saturation Sets, by grouping together assignments thatinteract now or are suspected to interact later:

• Overlap of two assignments = pairs of equal (or prone to beequal) variables and corresponding involved fd

• Two overlapping assignments are placed in the same Saturation Set.

Example: the assignments:am11 = {n:Leonardo, s:Di Caprio, a:42, Y1:N1, Y2:N2}

am31 = {n1:Leonardo, s1:Di Caprio, n2:Matthew, s2:David, E1:N15, E2:N16,

E3:N17}overlap on xn, n1

y, xs, s1y and e1.

5. The Conflict Graph and parallelizationThe Conflict Graph is a structure accounting for constraintsinteraction and a helper for finding overlapping assignments. ConflictGraph nodes correspond to s-t tgds, whereas an edge means that the twotgds may have overlapping assignments, further characterized by conflictareas adorning nodes.

v1v2 v3

Areaspv1q “ tca11 “ xpn, sq, e1yu.

Areaspv2q “ tca12 “ xpn

1, s1q, e1yu.

Areaspv3q “ tca13 “ xpn

2, s2q, e1y, ca

23 “ xpn

3, s3q, e1yu.

Nice news: a Saturation Set cannot span two distinct connectedcomponents of the graph. Besides speeding up Saturation Setconstruction, the Conflict Graph thus provides ChaseFUN with aparallelization opportunity, enhacing its scalability and overall speed!

6. Comparative performance assesmentWe stress-tested ChaseFUN using several large, iBench generated scenar-ios: OF scenarios (iBench Object Fusion), OF+ scenarios (OF + iBenchVertical Partitioning), OF++ scenarios (augmented OF+). On those, wehave moreover compared to one of the fastest state-of-the art DE engines:the Llunatic engine.

ScenariosSCENARIO s-t tgds OF OF+ OF++ # source tuples

A 15 5 egds 10 egds 15 egds 500KC 45 15 egds 30 egds 45 egds 1.5MF 90 30 egds 60 egds 90 egds 3M

References[1] A. Bonifati, I. Ileana, M. Linardi. Functional Dependencies Unleashedfor Scalable Data Exchange. In Proc. of SSDBM, 2016 https://arxiv.org/abs/1602.00563