CSC 261/461 –Database Systems Lecture 18•Project 3 is out (MongoDB): Due on: 04/12 •Project 2...

CSC 261/461 – Database SystemsLecture 18

Spring 2017MW 3:25 pm – 4:40 pm

January 18 – May 3Dewey 1101

Announcements

• Project 3 is out (MongoDB): Due on: 04/12• Project 2 Part 2 (Optional): Due on: 04/08• Project 1 Milestone 3: Due on: 04/05• Two Polls– Final Exam• We will have the Final on May 08

• Although

– Group vs. Individual Projects• (Mini) Project 3 and 4 both will be group project

CSC261,Spring2017,UR

Topics for Today

• Query Optimization (Chapter 19)


We will cover

1. LogicalOptimization

2. PhysicalOptimization

4

Logical vs. Physical Optimization

• Logical optimization:– Find equivalent plans that are more efficient– Intuition: Minimize # of tuples at each step by

changing the order of RA operators

• Physical optimization:– Find algorithm with lowest IO cost to execute

our plan– Intuition: Calculate based on physical

parameters (buffer size, etc.) and estimates of data size (histograms)

Execution

SQLQuery

RelationalAlgebra(RA)Plan

OptimizedRAPlan

1. LOGICAL OPTIMIZATION

6

What you will learn about in this section

1. OptimizationofRAPlans

7

RDBMS Architecture

How does a SQL engine work ?

SQLQuery


OptimizedRAPlan Execution

Declarativequery(fromuser)

Translatetorelationalalgebraexpresson

Findlogicallyequivalent- butmoreefficient- RAexpression

Executeeachoperatoroftheoptimizedplan!

RDBMS Architecture


SQLQuery



RelationalAlgebraallowsustotranslatedeclarative(SQL)queriesintopreciseandoptimizable expressions!

Recall: Logical Equivalence of RA Plans

• Given relations R(A,B) and S(B,C):

– Here, projection & selection commute: • 𝜎"#$(Π"(𝑅)) = Π"(𝜎"#$(𝑅))

–What about here?• 𝜎"#$(Π*(𝑅))?= Π*(𝜎"#$(𝑅))

We’lllookatthisinmoredepthlaterinthelecture…

RDBMS Architecture


SQLQuery



We’lllookathowtothenoptimizetheseplansnow

Note: We can visualize the plan as a tree

Π*

R(A,B) S(B,C)

Π*(𝑅 𝐴, 𝐵 ⋈ 𝑆 𝐵, 𝐶 )

Bottom-uptreetraversal=orderofoperationexecution!

A simple plan

Π*

R(A,B) S(B,C)

WhatSQLquerydoesthiscorrespondto?

ArethereanylogicallyequivalentRAexpressions?

“Pushing down” projection

Π*

R(A,B) S(B,C)

Π*

R(A,B) S(B,C)

Π*

Whymightwepreferthisplan?

Takeaways

• This process is called logical optimization

• Many equivalent plans used to search for “good plans”

• Relational algebra is an important abstraction.

RA commutators

• The basic commutators:– Push projection through (1) selection, (2) join– Push selection through (3) selection, (4) projection, (5) join– Also: Joins can be re-ordered!

• Note that this is not an exhaustive set of operations

ThissimplesetoftoolsallowsustogreatlyimprovetheexecutiontimeofqueriesbyoptimizingRAplans!

Optimizing the SFW RA Plan

Π",3

R(A,B) S(B,C)

T(C,D)

sA

Logical Optimization

• Heuristically, we want selections and projections to occur as early as possible in the plan – Terminology: “push down selections” and “push down

projections.”

• Intuition: We will have fewer tuples in a plan.

Π",3

R(A,B) S(B,C)

T(C,D)

sA

Π",3

R(A,B)

S(B,C)

T(C,D)

Π",3 𝑇 ⋈ 𝜎"456(𝑅) ⋈ 𝑆

SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B

AND S.C = T.CAND R.A < 10;

R(A,B) S(B,C) T(C,D)

Optimizing RA Plan

PushdownselectiononAsoitoccursearlier

sA

Π",3

R(A,B)

S(B,C)

T(C,D)

Π",3 𝑇 ⋈ 𝜎"456(𝑅) ⋈ 𝑆




Optimizing RA Plan

Pushdownprojectionsoitoccursearlier

sA

Π",3

R(A,B)

S(B,C)

T(C,D)

Π",3 𝑇 ⋈ Π",8 𝜎"456(𝑅) ⋈ 𝑆




Optimizing RA Plan

WeeliminateBearlier!

sA

2. PHYSICAL OPTIMIZATION

24

What you will learn about in this section

1. IndexSelection

2. Histograms

25

Index Selection

Input:– Schema of the database– Workload description: set of (query template, frequency) pairs

Goal: Select a set of indexes that minimize execution time of the workload.– Cost / benefit balance: Each additional index may help with some

queries, but requires updating

Thisisanoptimizationproblem!

Example

SELECT pname, FROM ProductWHERE year = ? AND Category = ? AND manufacturer = ?

SELECT pnameFROM ProductWHERE year = ? AND category = ?

Frequency10,000,000Workloaddescription:

Frequency10,000,000

Whichindexesmightwechoose?

Example

SELECT pnameFROM ProductWHERE year = ? AND Category =? AND manufacturer = ?

SELECT pnameFROM ProductWHERE year = ? AND category =?

Frequency10,000,00

0

Workloaddescription:

Frequency100

Nowwhichindexesmightwechoose?Worthkeepinganindexwithmanufacturerinitssearchkeyaround?

Estimating index cost?

• Note that to frame as optimization problem, we first need an estimate of the cost of an index lookup

• Need to be able to estimate the costs of different indexes / index types…

Wewillseethismainlydependsongettingestimatesofresultsetsize!

Ex: Clustered vs. Unclustered

Cost to do a range query for M entries over N-page file (P per page):

– Clustered: • To traverse: Logf(1.5N)

• To scan: 1 random IO + :;5<

sequential IO

– Unclustered: • To traverse: Logf(1.5N)• To scan: ~ M random IO

SupposeweareusingaB+Treeindexwith:• Fanout f• Fillfactor2/3

Plugging in some numbers

• Clustered: – To traverse: LogF(1.5N)– To scan: 1 random IO + :;5

<sequential IO

• Unclustered: – To traverse: LogF(1.5N)– To scan: ~ M random IO

• If M = 1, then there is no difference!• If M = 100,000 records, then difference is ~10min. Vs. 10ms!

Tosimplify:• RandomIO=~10ms• SequentialIO=free

~1randomIO=10ms

~M randomIO=M*10ms

IfonlywehadgoodestimatesofM…

HISTOGRAMS & IO COST ESTIMATION

32

IO Cost Estimation via Histograms

• For index selection:– What is the cost of an index lookup?

• Also for deciding which algorithm to use:– Ex: To execute R ⋈ 𝑆, which join algorithm should DBMS use?

– What if we want to compute 𝝈𝑨@𝟏𝟎(𝐑) ⋈ 𝝈𝑩#𝟏(𝑺)?

• In general, we will need some way to estimate intermediate result set sizes

Histogramsprovideawaytoefficientlystoreestimatesofthesequantities

Histograms

• A histogram is a set of value ranges (“buckets”) and the frequencies of values in those buckets occurring

• How to choose the buckets?– Equiwidth & Equidepth

• Turns out high-frequency values are very important

0

5

10

1 2 3 4 5 6 7 8 9 101112131415Values

Frequency

Howdowecomputehowmanyvaluesbetween8and10?(Yes,it’sobvious)

Problem:countstakeuptoomuchspace!

Example

Full vs. Uniform Counts

0

2

4

6

8

10

1 2 3 4 5 6 7 8 9 101112131415

Howmuchspacedothefullcounts(bucket_size=1)take?

Howmuchspacedotheuniformcounts(bucket_size=ALL)take?

Fundamental Tradeoffs

• Want high resolution (like the full counts)

• Want low space (like uniform)

• Histograms are a compromise!

Sohowdowecomputethe“bucket”sizes?

Equi-width

0

5

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Allbucketsroughlythesamewidth

Equidepth

02468

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Allbucketscontainroughlythesamenumberofitems(totalfrequency)

Histograms

• Simple, intuitive and popular

• Parameters: # of buckets and type

• Can extend to many attributes (multidimensional)

Maintaining Histograms

• Histograms require that we update them!– Typically, you must run/schedule a command to update statistics

on the database–Out of date histograms can be terrible!

Nasty example

0

5

10

1 2 3 4 5 6 7 8 9 101112131415

1.weinsertmanytuples withvalue>162.wedonotupdatethehistogram3.weaskforvalues>20?

Compressed Histograms

• One popular approach: 1. Store the most frequent values and their counts explicitly2. Keep an equiwidth or equidepth one for the rest of the values

Peoplecontinuetotryallmanneroffancinessherewavelets,graphicalmodels,entropymodels,…

Acknowledgement

• Some of the slides in this presentation are taken from the slides provided by the authors.

• Many of these slides are taken from cs145 course offered byStanford University.


CSC 261/461 –Database Systems Lecture 18•Project 3 is out (MongoDB): Due on: 04/12 •Project 2...

Documents

Transcript of CSC 261/461 –Database Systems Lecture 18•Project 3 is out (MongoDB): Due on: 04/12 •Project 2...