Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D...

120
Distributed Query Processing

Transcript of Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D...

Page 1: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Distributed Query Processing

Page 2: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Agenda

• Recap of query optimization

• Transformation rules for P&D systems

• Memoization

• Queries in heterogeneous systems

• Query evaluation strategies

• Eddies

Page 3: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Introduction

• Alternative ways of evaluating a given query– Equivalent expressions– Different algorithms for each operation (Chapter 13)

• Cost difference between a good and a bad way of evaluating a query can be enormous– Example: performing a r X s followed by a selection r.A = s.B is

much slower than performing a join on the same condition

• Need to estimate the cost of operations– Depends critically on statistical information about relations which

the database must maintain– Need to estimate statistics for intermediate results to compute cost

of complex expressions

Page 4: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Introduction (Cont.)

Relations generated by two equivalent expressions have the same set of attributes and contain the same set of tuples, although their attributes may be ordered differently.

Page 5: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Introduction (Cont.)

• Generation of query-evaluation plans for an expression involves several steps:

1. Generating logically equivalent expressions

• Use equivalence rules to transform an expression into an equivalent one.

2. Annotating resultant expressions to get alternative query plans

3. Choosing the cheapest plan based on estimated cost

• The overall process is called cost based optimization.

Page 6: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Equivalence Rules1. Conjunctive selection operations can be deconstructed

into a sequence of individual selections.

2. Selection operations are commutative.

3. Only the last in a sequence of projection operations is needed, the others can be omitted.

4. Selections can be combined with Cartesian products and theta joins.

a. (E1 X E2) = E1 E2

b. 1(E1 2 E2) = E1 1 2 E2

))(())((1221

EE

))(()(2121

EE

)())))((((121

EE ttntt

Page 7: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Equivalence Rules (Cont.)

5. Theta-join operations (and natural joins) are commutative.E1 E2 = E2 E1

6. (a) Natural join operations are associative:

(E1 E2) E3 = E1 (E2 E3)

(b) Theta joins are associative in the following manner:

(E1 1 E2) 2 3 E3 = E1 2 3 (E2 2 E3) where 2 involves attributes from only E2 and E3.

Page 8: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Pictorial Depiction of Equivalence Rules

Page 9: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Equivalence Rules (Cont.)

7. The selection operation distributes over the theta join operation under the following two conditions:(a) When all the attributes in 0 involve only the attributes of one of the expressions (E1) being joined.

0E1 E2) = (0(E1)) E2

(b) When 1 involves only the attributes of E1 and 2 involves only the attributes of E2.

1 E1 E2) = (1(E1)) ( (E2))

Page 10: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Equivalence Rules (Cont.)

8. The projections operation distributes over the theta join operation as follows:

(a) if involves only attributes from L1 L2:

(b) Consider a join E1 E2. – Let L1 and L2 be sets of attributes from E1 and E2,

respectively. – Let L3 be attributes of E1 that are involved in join

condition , but are not in L1 L2, and– let L4 be attributes of E2 that are involved in join

condition , but are not in L1 L2.

))(())(()( 2......12.......1 2121EEEE LLLL

)))(())((().....( 2......121 42312121EEEE LLLLLLLL

Page 11: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Equivalence Rules (Cont.)9. The set operations union and intersection are commutative

E1 E2 = E2 E1 E1 E2 = E2 E1

(set difference is not commutative).

10. Set union and intersection are associative.

(E1 E2) E3 = E1 (E2 E3)

(E1 E2) E3 = E1 (E2 E3)

11. The selection operation distributes over , and –.

(E1 – E2) = (E1) – (E2)

and similarly for and in place of –

Also: (E1 – E2) = (E1) – E2

and similarly for in place of –, but not for 12. The projection operation distributes over union

L(E1 E2) = (L(E1)) (L(E2))

Page 12: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Multiple Transformations (Cont.)

Page 13: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Optimizer strategies

• Heuristic

– Apply the transformation rules in a specific order such that the cost converges to a minimum

• Cost based

– Simulated annealing

– Randomized generation of candidate QEP

– Problem, how to guarantee randomness

Page 14: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Memoization Techniques

• How to generate alternative Query Evaluation Plans?

– Early generation systems centred around a tree representation of the plan

– Hardwired tree rewriting rules are deployed to enumerate part of the space of possible QEP

– For each alternative the total cost is determined

– The best (alternatives) are retained for execution

– Problems: very large space to explore, duplicate plans, local maxima, expensive query cost evaluation.

– SQL Server optimizer contains about 300 rules to be deployed.

Page 15: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Memoization Techniques

• How to generate alternative Query Evaluation Plans?

– Keep a memo of partial QEPs and their cost.

– Use the heuristic rules to generate alternatives to built more complex QEPs

– r1 r2 r3 r4

r1 r2 r2 r3 r3 r4 r1 r4

xLevel 1 plans

r3 r3Level 2 plans

Level n plans r4

r2 r1

Page 16: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Distributed Query Processing

• For centralized systems, the primary criterion for measuring the cost of a particular strategy is the number of disk accesses.

• In a distributed system, other issues must be taken into account:

– The cost of a data transmission over the network.

– The potential gain in performance from having several sites process parts of the query in parallel.

Page 17: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Transformation rules for distributed systems

• Primary horizontally fragmented table:

– Rule 9: The union is commutative E1 E2 = E2 E1

– Rule 10: Set union is associative. (E1 E2) E3 = E1 (E2 E3)

– Rule 12: The projection operation distributes over union

L(E1 E2) = (L(E1)) (L(E2))

• Derived horizontally fragmented table:

– The join through foreign-key dependency is already reflected in the fragmentation criteria

Page 18: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Transformation rules for distributed systems

Vertical fragmented tables:

– Rules: Hint look at projection rules

Page 19: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Optimization in Par & Distr

• Cost model is changed!!!

– Network transport is a dominant cost factor

• The facilities for query processing are not homogenous distributed

– Light-resource systems form a bottleneck

– Need for dynamic load scheduling

Page 20: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Simple Distributed Join Processing

• Consider the following relational algebra expression in which the three relations are neither replicated nor fragmented

account depositor branch

• account is stored at site S1

• depositor at S2

• branch at S3

• For a query issued at site SI, the system needs to produce the result at site SI

Page 21: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Possible Query Processing Strategies

• Ship copies of all three relations to site SI and choose a strategy for processing the entire locally at site SI.

• Ship a copy of the account relation to site S2 and compute temp1 = account depositor at S2. Ship temp1 from S2 to S3, and compute temp2 = temp1 branch at S3. Ship the result temp2 to SI.

• Devise similar strategies, exchanging the roles S1, S2, S3

• Must consider following factors:– amount of data being shipped – cost of transmitting a data block between sites– relative processing speed at each site

Page 22: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Semijoin Strategy

• Let r1 be a relation with schema R1 stores at site S1

Let r2 be a relation with schema R2 stores at site S2

• Evaluate the expression r1 r2 and obtain the result at S1.

1. Compute temp1 R1 R2 (r1) at S1.

2. Ship temp1 from S1 to S2.

3. Compute temp2 r2 temp1 at S2

4. Ship temp2 from S2 to S1.

5. Compute r1 temp2 at S1. This is the same as r1 r2.

Page 23: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Formal Definition

• The semijoin of r1 with r2, is denoted by:

r1 r2

• it is defined by:

R1 (r1 r2)

• Thus, r1 r2 selects those tuples of r1 that contributed to r1 r2.

• In step 3 above, temp2=r2 r1.

• For joins of several relations, the above strategy can be extended to a series of semijoin steps.

Page 24: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Join Strategies that Exploit Parallelism

• Consider r1 r2 r3 r4 where relation ri is stored at site Si. The

result must be presented at site S1.

• r1 is shipped to S2 and r1 r2 is computed at S2: simultaneously r3 is

shipped to S4 and r3 r4 is computed at S4

• S2 ships tuples of (r1 r2) to S1 as they produced;

S4 ships tuples of (r3 r4) to S1

• Once tuples of (r1 r2) and (r3 r4) arrive at S1 (r1 r2) (r3

r4) is computed in parallel with the computation of (r1 r2) at S2 and

the computation of (r3 r4) at S4.

Page 25: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Query plan generation

• Apers-Aho-Hopcroft

– Hill-climber, repeatedly split the multi-join query in fragments and optimize its subqueries independently

• Apply centralized algorithms and rely on cost-model to avoid expensive query execution plans.

Page 26: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Query evaluators

Page 27: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Query evaluation strategy

• Pipe-line query evaluation strategy

– Evaluation:

• Oriented towards OLTP applications– Granule size of data interchange

• Items produced one at a time

• No temporary files– Choice of intermediate buffer size allocations

• Query executed as one process

• Generic interface, sufficient to add the iterator primitives for the new containers.

• CPU intensive

• Amenable to parallelization

Page 28: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Query evaluation strategy

• Pipe-line query evaluation strategy

– Called Volcano query processing model

– Standard in commercial systems and MySQL

• Basic algorithm:

– Demand-driven evaluation of query tree.

– Operators exchange data in units such as records

– Each operator supports the following interfaces:– open, next, close

• open() at top of tree results in cascade of opens down the tree.

• An operator getting a next() call may recursively make next() calls from within to produce its next answer.

• close() at top of tree results in cascade of close down the tree

Page 29: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Volcano Refresher

Query

SELECT name, salary*.19 AS tax

FROMemployee

WHERE age > 25

Try to maximize performance

Page 30: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Volcano Refresher

Operators

Iterator interface-open()-next(): tuple-close()

Try to maximize performance

Page 31: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

• The Volcano model is based on a simple pull-based iterator model for programming relational operators.

• The Volcano model minimizes the amount of intermediate store

• The Volcano model is CPU intensive and inefficient

Try to maximize performance

Volcano paradigm

Page 32: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

MonetDB paradigm

• The MonetDB kernel is a programmable relational algebra machine

• Relational operators operate on ‘array’-like structures

Try to use simple a software pattern

Page 33: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Query evaluation strategy

• Materialized evaluation strategy

– Used in MonetDB

– Basic algorithm:

• for each relational operator produce the complete intermediate result using materialized operands

– Evaluation:

• Oriented towards decision support queries

• Limited internal administration and dependencies

• Basis for multi-query optimization strategy

• Memory intensive

• Amendable for distributed/parallel processing

Page 34: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

SQL

MonetDB Server

MonetDB Kernel

XQuery

MAL

function user.s3_1():void; X1:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",0); X6:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",1); X9:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",2); X13:bat[:oid,:oid] := sql.bind_dbat("sys","photoobjall",1); X8 := algebra.kunion(X1,X6); X11 := algebra.kdifference(X8,X9); X12 := algebra.kunion(X11,X9); X14 := bat.reverse(X13); X15 := algebra.kdifference(X12,X14); X16 := calc.oid(0@0); X18 := algebra.markT(X15,X16); X19 := bat.reverse(X18); X20 := aggr.count(X19); sql.exportValue(1,"sys.","count_","int",32,0,6,X20,"");end s3_1;

select count(*) from photoobjall;

Try to use simple a software pattern

Page 35: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Operator implementation

• All algebraic operators materialize their result

• Local optimization decisions

• Heavy use of code expansion to reduce cost– 55 selection routines– 149 unary operations– 335 join/group operations– 134 multi-join operations– 72 aggregate operations

Try to use simple a software pattern

Page 36: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Micro-benchmark

MonetDB/SQL 0.34 N 44

MySQL 25.1 N 238

PostgreSQL 10.6 N 1230

Commercial 1 39.0 N 800

Commercial 2 17 N 150

In milliseconds/10KFixed cost in ms

• Keeping the query result in a new table is often too expensive

select * into tmp from tapestry where attr1>=0 and attr1 <=@range

create table tmp( attr0 int, attr1 int);insert into tmp select * from tapestry

where attr1>=0 and attr1 <=@range;

Page 37: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Multi-column tapestry

Experiments ran on Athlon 1.4, Linux

commercial

MonetDB/SQL

#joins

ms

Page 38: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

• A column store should be designed from scratch to benefit from its characteristics

• Simulation of a column store on top of an n-ary system using the Volcano model does not work

Page 39: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Try to maximize performance

Paste

Present

Potency

Execution Paradigm

DatabaseStructures

Queryoptimizer

Page 40: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

• Applications have different characteristics

• Platforms have different characteristics

• The actual state of computation is crucial

• A generic all-encompassing optimizer cost-model does

not work

Try to avoid the search space trap

Page 41: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

SQL

MonetDB Server

MonetDB Kernel

XQuery

MAL

MAL

Operational optimizer:– Exploit everything you know at runtime– Re-organize if necessary

Try to disambiguate decisions

Page 42: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

SQL

MonetDB Server

MonetDB Kernel

XQuery

MAL

MAL

Strategic optimizer:– Exploit the semantics of the language– Rely on heuristics

Operational optimizer:– Exploit everything you know at runtime– Re-organize if necessary

Try to disambiguate decisions

Page 43: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

SQL

MonetDB Server

Tactical Optimizer

MonetDB Kernel

XQuery

MAL

MAL

y1:bat[:oid,:dbl]:= bpm.take("sys_photoobjall_ra");y2 := bpm.new(:oid,:oid);barrier rs:= bpm.newIterator(y1,A0,A1); t1:= algebra.uselect(rs,A0,A1); bpm.addSegment(y2,t1);redo rs:= bpm.hasMoreElements(y1,A0,A1);exit rs;

x1:bat[:oid,:dbl]:= sql.bind("sys","photoobjall","ra",0);x14:= algebra.uselect(x1,A0,A1);

Tactical MAL optimizer:– No changes in front-ends and no direct human guidance– Minimal changes in the engine

Try to disambiguate decisions

Page 44: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Code Inliner. Constant Expression Evaluator. Accumulator Evaluations.

Strength Reduction. Common Term Optimizer.

Join Path Optimizer. Ranges Propagation. Operator Cost Reduction. Foreign Key handling. Aggregate Groups.

Code Parallizer. Replication Manager.

Result Recycler.

MAL Compiler. Dynamic Query Scheduler. Memo-based Execution.

Vector Execution.

Alias Removal. Dead Code Removal. Garbage Collector.

Try to disambiguate decisions

Page 45: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Try to maximize performance

Paste

Present

Potency

Execution Paradigm

DatabaseStructures

Queryoptimizer

Page 46: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Execution paradigms

• The MonetDB kernel is set up to accommodate different execution engines

• The MonetDB assembler program is

– Interpreted in the order presented

– Interpreted in a dataflow driven manner

– Compiled into a C program

– Vectorised processing

• X100 project

No data from persistent store to the memory trash

Page 47: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

MonetDB/x100

Combine Volcano model withvector processing.

All vectors together should fit the CPU cache

Vectors are compressed

Optimizer should tune this,given the query characteristics.

ColumnBM (buffer manager)

X100 query engine

CPUcache

networkedColumnBM-s

RAM

Page 48: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

• Varying the vector size on TPC-H query 1

mysql, oracle,

db2

X100

MonetDB

low IPC, overhead

RAM bandwidth

bound

No data from persistent store to the memory trash

Page 49: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

• Vectorized-Volcano processing can be used for both multi-core and distributed processing

• The architecture and the parameters are influenced heavily by

– Hardware characteristics

– Data distribution to compress columns

No data from persistent store to the memory trash

Page 50: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Try to maximize performance

Paste

Present

Potency

CrackingB-tree, HashIndices

MaterializedViews

Page 51: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

• Indices in database systems focus on:

– All tuples are equally important for fast retrieval

– There are ample resources to maintain indices

• MonetDB cracks the database into pieces based on actual query load

Find a trusted fortune teller

Page 52: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking algorithms

Physical reorganization happens per column based on selection predicates.

Split a piece of a column in two new pieces

A<10

A>=10

A<10

Page 53: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking algorithms

Physical reorganization happens per column

Split a piece of a column in two new pieces

Split a piece of a column in three new pieces

A<10

A>=10

A<10

5<A<10

A>=10

5<A<10

A<5

Page 54: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

12

13

4

17

15

select A>5 and A<10

Page 55: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

Page 56: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12 >=10

>=10

Page 57: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

>=10

Page 58: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

>=10

<=5

Page 59: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

>=10

<=5

Page 60: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

>=10

<=5

Page 61: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

>=10

<=5

Page 62: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

>=10

Page 63: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

>=10

Page 64: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<=5

Page 65: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<=5

<=5

Page 66: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<=5

>5 and <10

Page 67: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6 2

15

13

4

17

12

<=5

>5 and <10

Page 68: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6 2

15

13

4

17

12

<=5

>5 and <10

Page 69: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<=5

>5 and <10

Page 70: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

>5 and <10

Page 71: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

13

4

17

12

<= 5

>= 10

> 5

15

Page 72: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

Page 73: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

Page 74: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Page 75: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Page 76: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Page 77: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

racking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

>3 and <14

<=3

Page 78: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

>3 and <14

<=3

Page 79: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

>3 and <14

<=3

Page 80: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

>3 and <14

<=3

Page 81: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

<=3

Page 82: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

> 3

>= 10

> 5

<=3

Page 83: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

> 3

>= 10

> 5

<=3

Page 84: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

> 3

>= 10

> 5

<=3

Page 85: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

1712

> 3

>= 10

> 5

<=3

Page 86: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

1712

> 3

>= 10

> 5

<=3

Page 87: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

> 3

>= 10

> 5

<=3

Page 88: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

> 3

>= 10

> 5

<=3

Page 89: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

> 3

>= 10

> 5

<=3

Page 90: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

> 3

>= 14

> 5

<=3

>=10

Page 91: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

> 3

>= 14

> 5

<=3

>=10

Page 92: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Cracking example

3

8

6

2

15

13

4

17

12

select A>5 and A<10

3

8

6

2

15

13

4

17

12

<= 5

>= 10

> 5

Improve data access for

future queries

select A>3 and A<14

3

8

6

2

15

13

4

17

12

>3

>= 14

> 5

<=3

>=10

The more we crack the more

we learn

Page 93: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Design

The first time a range query is posed on an attribute A, a cracking DBMS makes a copy of column A, called the cracker column of A A cracker column is continuously physically reorganized based on queries that need to touch attribute such as the result is in a contiguous space

For each cracker column, there is a cracker index

Cracker Index

Cracker Column

Page 94: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

A simple range queryTry to avoid useless investments

Page 95: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

TPC-H query 6

Try to avoid useless investments

Page 96: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

• Cracking is easy in a column store and is part of the critical execution path

• Cracking works under high volume updates

Try to avoid useless investments

Page 97: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Updates

Base columns are updated as normally

We need to update the cracker column and the cracker index

Efficiently

Maintain the self-organization properties

Two issues: When How

Page 98: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

When to propagate updates in cracking

Follow the workload to maintain self-organization

Updates become part of query processing

When an update arrives, it is not applied

For each cracker column there is a pending insertions column and a pending deletions column

Pending updates are applied only when a query needs the specific values

Page 99: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Updates aware select

We extended the cracker select operator to apply the needed updates before cracking

The select operator:1. Search the pending insertions column2. Search the pending deletions column3. If Steps 1 or 2 find tuples run an update algorithm4. Search the cracker index5. Physically reorganize the cracker column6. Update the cracker index7. Return a slice of the cracker column

Page 100: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Merging

7

2

10

29

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >12

Start position: 1values: >1

Insert a new tuple with value 9

The new tuple belongs to the blue piece 9

Page 101: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Merging

7

2

10

29

25

31

57

42

53

Start position: 8values: >35

Start position: 5values: >12

Start position: 1values: >1

Insert a new tuple with value 9

The new tuple belongs to the blue piece

9

Pieces in the cracker column are ordered

Tuples inside a piece are not ordered

Shifting is not a viable solution

Page 102: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Merging by Hopping

7

2

10

29

25

31

42

53 Start position: 8values: >35

Start position: 4values: >12

Start position: 1values: >1

57

9Insert a new tuple with value 9

We need to make enough room to fit the new tuples

Page 103: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

Merge Gradually

A query merges only the qualifying values, i.e., only the values that it needs for a correct and complete result

Average cost increases significantly

We avoid the large peaks but...

Merge CompletelyMerge Gradually

Page 104: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

Touch only the pieces that are relevant for the current query

Page 105: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

10

29

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >22

Start position: 1values: >1

Touch only the pieces that are relevant for the current query

Page 106: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

10

29

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >22

Start position: 1values: >1

Select 7<= A< 15Touch only the pieces that are relevant for the current query

Page 107: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

10

29

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >22

Start position: 1values: >1

Select 7<= A< 15

5

9

16

35

Pending insertions

Touch only the pieces that are relevant for the current query

Page 108: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

10

29

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >22

Start position: 1values: >1 5

9

16

35

Pending insertions

Touch only the pieces that are relevant for the current querySelect 7<= A< 15

Page 109: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

10

29

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >22

Start position: 1values: >1 5

9

16

35

Pending insertions

Touch only the pieces that are relevant for the current querySelect 7<= A< 15

Page 110: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

10

29

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >22

Start position: 1values: >1 5

9

16

35

Pending insertions

Touch only the pieces that are relevant for the current querySelect 7<= A< 15

Page 111: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

10

29

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >22

Start position: 1values: >1 5

9

16

35

Pending insertions

Touch only the pieces that are relevant for the current query

Avoid shifting down non interesting pieces

Select 7<= A< 15

Page 112: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

10

29

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >22

Start position: 1values: >1 5

9

16

35

Pending insertions

Touch only the pieces that are relevant for the current query

Avoid shifting down non interesting pieces

Select 7<= A< 15

Page 113: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

1029

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >22

Start position: 1values: >1 5

9

16

35

Pending insertions

Touch only the pieces that are relevant for the current query

Immediately make room for the new tuples

Avoid shifting down non interesting pieces

Select 7<= A< 15

Page 114: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

1029

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >22

Start position: 1values: >1 5

916

35

Pending insertions

Touch only the pieces that are relevant for the current query

Immediately make room for the new tuples

Avoid shifting down non interesting pieces

Select 7<= A< 15

Page 115: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

1029

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >22

Start position: 1values: >1 5

916

35

Pending insertions

Touch only the pieces that are relevant for the current query

Immediately make room for the new tuples

Avoid shifting down non interesting pieces

Select 7<= A< 15

Page 116: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

10

25

31

57

42

53

Start position: 7values: >35

Start position: 4values: >22

Start position: 1values: >1 5

916

35

Pending insertions

29

Touch only the pieces that are relevant for the current query

Immediately make room for the new tuples

Avoid shifting down non interesting pieces

Select 7<= A< 15

Page 117: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

10

25

31

57

42

53

Start position: 7values: >35

Start position: 5values: >22

Start position: 1values: >1 5

916

35

Pending insertions

29

Touch only the pieces that are relevant for the current query

Immediately make room for the new tuples

Avoid shifting down non interesting pieces

Select 7<= A< 15

Page 118: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

7

2

10

25

31

57

42

53

Start position: 7values: >35

Start position: 5values: >22

Start position: 1values: >1 5

916

35

Pending insertions

29

Touch only the pieces that are relevant for the current query

Immediately make room for the new tuples

Avoid shifting down non interesting pieces

Select 7<= A< 15

Page 119: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

Maintain high performance through the whole query sequence in a self-organizing way

Page 120: Distributed Query Processing. Agenda Recap of query optimization Transformation rules for P&D systems Memoization Queries in heterogeneous systems Query.

The Ripple

Maintain high performance through the whole query sequence in a self-organizing way

Merge Gradually Merge Completely

Merge Ripple