CSC 261/461 –Database Systems Lecture 18•Project 3 is out (MongoDB): Due on: 04/12 •Project 2...
Transcript of CSC 261/461 –Database Systems Lecture 18•Project 3 is out (MongoDB): Due on: 04/12 •Project 2...
-
CSC 261/461 – Database SystemsLecture 18
Spring 2017MW 3:25 pm – 4:40 pm
January 18 – May 3Dewey 1101
-
Announcements
• Project 3 is out (MongoDB): Due on: 04/12• Project 2 Part 2 (Optional): Due on: 04/08• Project 1 Milestone 3: Due on: 04/05• Two Polls– Final Exam• We will have the Final on May 08
• Although
– Group vs. Individual Projects• (Mini) Project 3 and 4 both will be group project
CSC261,Spring2017,UR
-
Topics for Today
• Query Optimization (Chapter 19)
CSC261,Spring2017,UR
-
We will cover
1. LogicalOptimization
2. PhysicalOptimization
4
-
Logical vs. Physical Optimization
• Logical optimization:– Find equivalent plans that are more efficient– Intuition: Minimize # of tuples at each step by
changing the order of RA operators
• Physical optimization:– Find algorithm with lowest IO cost to execute
our plan– Intuition: Calculate based on physical
parameters (buffer size, etc.) and estimates of data size (histograms)
Execution
SQLQuery
RelationalAlgebra(RA)Plan
OptimizedRAPlan
-
1. LOGICAL OPTIMIZATION
6
-
What you will learn about in this section
1. OptimizationofRAPlans
7
-
RDBMS Architecture
How does a SQL engine work ?
SQLQuery
RelationalAlgebra(RA)Plan
OptimizedRAPlan Execution
Declarativequery(fromuser)
Translatetorelationalalgebraexpresson
Findlogicallyequivalent- butmoreefficient- RAexpression
Executeeachoperatoroftheoptimizedplan!
-
RDBMS Architecture
How does a SQL engine work ?
SQLQuery
RelationalAlgebra(RA)Plan
OptimizedRAPlan Execution
RelationalAlgebraallowsustotranslatedeclarative(SQL)queriesintopreciseandoptimizable expressions!
-
Recall: Logical Equivalence of RA Plans
• Given relations R(A,B) and S(B,C):
– Here, projection & selection commute: • 𝜎"#$(Π"(𝑅)) = Π"(𝜎"#$(𝑅))
–What about here?• 𝜎"#$(Π*(𝑅))?= Π*(𝜎"#$(𝑅))
We’lllookatthisinmoredepthlaterinthelecture…
-
RDBMS Architecture
How does a SQL engine work ?
SQLQuery
RelationalAlgebra(RA)Plan
OptimizedRAPlan Execution
We’lllookathowtothenoptimizetheseplansnow
-
Note: We can visualize the plan as a tree
Π*
R(A,B) S(B,C)
Π*(𝑅 𝐴, 𝐵 ⋈ 𝑆 𝐵, 𝐶 )
Bottom-uptreetraversal=orderofoperationexecution!
-
A simple plan
Π*
R(A,B) S(B,C)
WhatSQLquerydoesthiscorrespondto?
ArethereanylogicallyequivalentRAexpressions?
-
“Pushing down” projection
Π*
R(A,B) S(B,C)
Π*
R(A,B) S(B,C)
Π*
Whymightwepreferthisplan?
-
Takeaways
• This process is called logical optimization
• Many equivalent plans used to search for “good plans”
• Relational algebra is an important abstraction.
-
RA commutators
• The basic commutators:– Push projection through (1) selection, (2) join– Push selection through (3) selection, (4) projection, (5) join– Also: Joins can be re-ordered!
• Note that this is not an exhaustive set of operations
ThissimplesetoftoolsallowsustogreatlyimprovetheexecutiontimeofqueriesbyoptimizingRAplans!
-
Optimizing the SFW RA Plan
-
Π",3
R(A,B) S(B,C)
T(C,D)
sA
-
Logical Optimization
• Heuristically, we want selections and projections to occur as early as possible in the plan – Terminology: “push down selections” and “push down
projections.”
• Intuition: We will have fewer tuples in a plan.
-
Π",3
R(A,B) S(B,C)
T(C,D)
sA
-
Π",3
R(A,B)
S(B,C)
T(C,D)
Π",3 𝑇 ⋈ 𝜎"456(𝑅) ⋈ 𝑆
SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B
AND S.C = T.CAND R.A < 10;
R(A,B) S(B,C) T(C,D)
Optimizing RA Plan
PushdownselectiononAsoitoccursearlier
sA
-
Π",3
R(A,B)
S(B,C)
T(C,D)
Π",3 𝑇 ⋈ 𝜎"456(𝑅) ⋈ 𝑆
SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B
AND S.C = T.CAND R.A < 10;
R(A,B) S(B,C) T(C,D)
Optimizing RA Plan
Pushdownprojectionsoitoccursearlier
sA
-
Π",3
R(A,B)
S(B,C)
T(C,D)
Π",3 𝑇 ⋈ Π",8 𝜎"456(𝑅) ⋈ 𝑆
SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B
AND S.C = T.CAND R.A < 10;
R(A,B) S(B,C) T(C,D)
Optimizing RA Plan
WeeliminateBearlier!
sA
-
2. PHYSICAL OPTIMIZATION
24
-
What you will learn about in this section
1. IndexSelection
2. Histograms
25
-
Index Selection
Input:– Schema of the database– Workload description: set of (query template, frequency) pairs
Goal: Select a set of indexes that minimize execution time of the workload.– Cost / benefit balance: Each additional index may help with some
queries, but requires updating
Thisisanoptimizationproblem!
-
Example
SELECT pname, FROM ProductWHERE year = ? AND Category = ? AND manufacturer = ?
SELECT pnameFROM ProductWHERE year = ? AND category = ?
Frequency10,000,000Workloaddescription:
Frequency10,000,000
Whichindexesmightwechoose?
-
Example
SELECT pnameFROM ProductWHERE year = ? AND Category =? AND manufacturer = ?
SELECT pnameFROM ProductWHERE year = ? AND category =?
Frequency10,000,00
0
Workloaddescription:
Frequency100
Nowwhichindexesmightwechoose?Worthkeepinganindexwithmanufacturerinitssearchkeyaround?
-
Estimating index cost?
• Note that to frame as optimization problem, we first need an estimate of the cost of an index lookup
• Need to be able to estimate the costs of different indexes / index types…
Wewillseethismainlydependsongettingestimatesofresultsetsize!
-
Ex: Clustered vs. Unclustered
Cost to do a range query for M entries over N-page file (P per page):
– Clustered: • To traverse: Logf(1.5N)
• To scan: 1 random IO + :;5<
sequential IO
– Unclustered: • To traverse: Logf(1.5N)• To scan: ~ M random IO
SupposeweareusingaB+Treeindexwith:• Fanout f• Fillfactor2/3
-
Plugging in some numbers
• Clustered: – To traverse: LogF(1.5N)– To scan: 1 random IO + :;5
<sequential IO
• Unclustered: – To traverse: LogF(1.5N)– To scan: ~ M random IO
• If M = 1, then there is no difference!• If M = 100,000 records, then difference is ~10min. Vs. 10ms!
Tosimplify:• RandomIO=~10ms• SequentialIO=free
~1randomIO=10ms
~M randomIO=M*10ms
IfonlywehadgoodestimatesofM…
-
HISTOGRAMS & IO COST ESTIMATION
32
-
IO Cost Estimation via Histograms
• For index selection:– What is the cost of an index lookup?
• Also for deciding which algorithm to use:– Ex: To execute R ⋈ 𝑆, which join algorithm should DBMS use?
– What if we want to compute 𝝈𝑨@𝟏𝟎(𝐑) ⋈ 𝝈𝑩#𝟏(𝑺)?
• In general, we will need some way to estimate intermediate result set sizes
Histogramsprovideawaytoefficientlystoreestimatesofthesequantities
-
Histograms
• A histogram is a set of value ranges (“buckets”) and the frequencies of values in those buckets occurring
• How to choose the buckets?– Equiwidth & Equidepth
• Turns out high-frequency values are very important
-
0
5
10
1 2 3 4 5 6 7 8 9 101112131415Values
Frequency
Howdowecomputehowmanyvaluesbetween8and10?(Yes,it’sobvious)
Problem:countstakeuptoomuchspace!
Example
-
Full vs. Uniform Counts
0
2
4
6
8
10
1 2 3 4 5 6 7 8 9 101112131415
Howmuchspacedothefullcounts(bucket_size=1)take?
Howmuchspacedotheuniformcounts(bucket_size=ALL)take?
-
Fundamental Tradeoffs
• Want high resolution (like the full counts)
• Want low space (like uniform)
• Histograms are a compromise!
Sohowdowecomputethe“bucket”sizes?
-
Equi-width
0
5
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Allbucketsroughlythesamewidth
-
Equidepth
02468
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Allbucketscontainroughlythesamenumberofitems(totalfrequency)
-
Histograms
• Simple, intuitive and popular
• Parameters: # of buckets and type
• Can extend to many attributes (multidimensional)
-
Maintaining Histograms
• Histograms require that we update them!– Typically, you must run/schedule a command to update statistics
on the database–Out of date histograms can be terrible!
-
Nasty example
0
5
10
1 2 3 4 5 6 7 8 9 101112131415
1.weinsertmanytuples withvalue>162.wedonotupdatethehistogram3.weaskforvalues>20?
-
Compressed Histograms
• One popular approach: 1. Store the most frequent values and their counts explicitly2. Keep an equiwidth or equidepth one for the rest of the values
Peoplecontinuetotryallmanneroffancinessherewavelets,graphicalmodels,entropymodels,…
-
Acknowledgement
• Some of the slides in this presentation are taken from the slides provided by the authors.
• Many of these slides are taken from cs145 course offered byStanford University.
CSC261,Spring2017,UR