CSC 261/461 –Database Systems Lecture 18•Project 3 is out (MongoDB): Due on: 04/12 •Project 2...

44
CSC 261/461 Database Systems Lecture 18 Spring 2017 MW 3:25 pm – 4:40 pm January 18 – May 3 Dewey 1101

Transcript of CSC 261/461 –Database Systems Lecture 18•Project 3 is out (MongoDB): Due on: 04/12 •Project 2...

  • CSC 261/461 – Database SystemsLecture 18

    Spring 2017MW 3:25 pm – 4:40 pm

    January 18 – May 3Dewey 1101

  • Announcements

    • Project 3 is out (MongoDB): Due on: 04/12• Project 2 Part 2 (Optional): Due on: 04/08• Project 1 Milestone 3: Due on: 04/05• Two Polls– Final Exam• We will have the Final on May 08

    • Although

    – Group vs. Individual Projects• (Mini) Project 3 and 4 both will be group project

    CSC261,Spring2017,UR

  • Topics for Today

    • Query Optimization (Chapter 19)

    CSC261,Spring2017,UR

  • We will cover

    1. LogicalOptimization

    2. PhysicalOptimization

    4

  • Logical vs. Physical Optimization

    • Logical optimization:– Find equivalent plans that are more efficient– Intuition: Minimize # of tuples at each step by

    changing the order of RA operators

    • Physical optimization:– Find algorithm with lowest IO cost to execute

    our plan– Intuition: Calculate based on physical

    parameters (buffer size, etc.) and estimates of data size (histograms)

    Execution

    SQLQuery

    RelationalAlgebra(RA)Plan

    OptimizedRAPlan

  • 1. LOGICAL OPTIMIZATION

    6

  • What you will learn about in this section

    1. OptimizationofRAPlans

    7

  • RDBMS Architecture

    How does a SQL engine work ?

    SQLQuery

    RelationalAlgebra(RA)Plan

    OptimizedRAPlan Execution

    Declarativequery(fromuser)

    Translatetorelationalalgebraexpresson

    Findlogicallyequivalent- butmoreefficient- RAexpression

    Executeeachoperatoroftheoptimizedplan!

  • RDBMS Architecture

    How does a SQL engine work ?

    SQLQuery

    RelationalAlgebra(RA)Plan

    OptimizedRAPlan Execution

    RelationalAlgebraallowsustotranslatedeclarative(SQL)queriesintopreciseandoptimizable expressions!

  • Recall: Logical Equivalence of RA Plans

    • Given relations R(A,B) and S(B,C):

    – Here, projection & selection commute: • 𝜎"#$(Π"(𝑅)) = Π"(𝜎"#$(𝑅))

    –What about here?• 𝜎"#$(Π*(𝑅))?= Π*(𝜎"#$(𝑅))

    We’lllookatthisinmoredepthlaterinthelecture…

  • RDBMS Architecture

    How does a SQL engine work ?

    SQLQuery

    RelationalAlgebra(RA)Plan

    OptimizedRAPlan Execution

    We’lllookathowtothenoptimizetheseplansnow

  • Note: We can visualize the plan as a tree

    Π*

    R(A,B) S(B,C)

    Π*(𝑅 𝐴, 𝐵 ⋈ 𝑆 𝐵, 𝐶 )

    Bottom-uptreetraversal=orderofoperationexecution!

  • A simple plan

    Π*

    R(A,B) S(B,C)

    WhatSQLquerydoesthiscorrespondto?

    ArethereanylogicallyequivalentRAexpressions?

  • “Pushing down” projection

    Π*

    R(A,B) S(B,C)

    Π*

    R(A,B) S(B,C)

    Π*

    Whymightwepreferthisplan?

  • Takeaways

    • This process is called logical optimization

    • Many equivalent plans used to search for “good plans”

    • Relational algebra is an important abstraction.

  • RA commutators

    • The basic commutators:– Push projection through (1) selection, (2) join– Push selection through (3) selection, (4) projection, (5) join– Also: Joins can be re-ordered!

    • Note that this is not an exhaustive set of operations

    ThissimplesetoftoolsallowsustogreatlyimprovetheexecutiontimeofqueriesbyoptimizingRAplans!

  • Optimizing the SFW RA Plan

  • Π",3

    R(A,B) S(B,C)

    T(C,D)

    sA

  • Logical Optimization

    • Heuristically, we want selections and projections to occur as early as possible in the plan – Terminology: “push down selections” and “push down

    projections.”

    • Intuition: We will have fewer tuples in a plan.

  • Π",3

    R(A,B) S(B,C)

    T(C,D)

    sA

  • Π",3

    R(A,B)

    S(B,C)

    T(C,D)

    Π",3 𝑇 ⋈ 𝜎"456(𝑅) ⋈ 𝑆

    SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B

    AND S.C = T.CAND R.A < 10;

    R(A,B) S(B,C) T(C,D)

    Optimizing RA Plan

    PushdownselectiononAsoitoccursearlier

    sA

  • Π",3

    R(A,B)

    S(B,C)

    T(C,D)

    Π",3 𝑇 ⋈ 𝜎"456(𝑅) ⋈ 𝑆

    SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B

    AND S.C = T.CAND R.A < 10;

    R(A,B) S(B,C) T(C,D)

    Optimizing RA Plan

    Pushdownprojectionsoitoccursearlier

    sA

  • Π",3

    R(A,B)

    S(B,C)

    T(C,D)

    Π",3 𝑇 ⋈ Π",8 𝜎"456(𝑅) ⋈ 𝑆

    SELECT R.A,S.DFROM R,S,TWHERE R.B = S.B

    AND S.C = T.CAND R.A < 10;

    R(A,B) S(B,C) T(C,D)

    Optimizing RA Plan

    WeeliminateBearlier!

    sA

  • 2. PHYSICAL OPTIMIZATION

    24

  • What you will learn about in this section

    1. IndexSelection

    2. Histograms

    25

  • Index Selection

    Input:– Schema of the database– Workload description: set of (query template, frequency) pairs

    Goal: Select a set of indexes that minimize execution time of the workload.– Cost / benefit balance: Each additional index may help with some

    queries, but requires updating

    Thisisanoptimizationproblem!

  • Example

    SELECT pname, FROM ProductWHERE year = ? AND Category = ? AND manufacturer = ?

    SELECT pnameFROM ProductWHERE year = ? AND category = ?

    Frequency10,000,000Workloaddescription:

    Frequency10,000,000

    Whichindexesmightwechoose?

  • Example

    SELECT pnameFROM ProductWHERE year = ? AND Category =? AND manufacturer = ?

    SELECT pnameFROM ProductWHERE year = ? AND category =?

    Frequency10,000,00

    0

    Workloaddescription:

    Frequency100

    Nowwhichindexesmightwechoose?Worthkeepinganindexwithmanufacturerinitssearchkeyaround?

  • Estimating index cost?

    • Note that to frame as optimization problem, we first need an estimate of the cost of an index lookup

    • Need to be able to estimate the costs of different indexes / index types…

    Wewillseethismainlydependsongettingestimatesofresultsetsize!

  • Ex: Clustered vs. Unclustered

    Cost to do a range query for M entries over N-page file (P per page):

    – Clustered: • To traverse: Logf(1.5N)

    • To scan: 1 random IO + :;5<

    sequential IO

    – Unclustered: • To traverse: Logf(1.5N)• To scan: ~ M random IO

    SupposeweareusingaB+Treeindexwith:• Fanout f• Fillfactor2/3

  • Plugging in some numbers

    • Clustered: – To traverse: LogF(1.5N)– To scan: 1 random IO + :;5

    <sequential IO

    • Unclustered: – To traverse: LogF(1.5N)– To scan: ~ M random IO

    • If M = 1, then there is no difference!• If M = 100,000 records, then difference is ~10min. Vs. 10ms!

    Tosimplify:• RandomIO=~10ms• SequentialIO=free

    ~1randomIO=10ms

    ~M randomIO=M*10ms

    IfonlywehadgoodestimatesofM…

  • HISTOGRAMS & IO COST ESTIMATION

    32

  • IO Cost Estimation via Histograms

    • For index selection:– What is the cost of an index lookup?

    • Also for deciding which algorithm to use:– Ex: To execute R ⋈ 𝑆, which join algorithm should DBMS use?

    – What if we want to compute 𝝈𝑨@𝟏𝟎(𝐑) ⋈ 𝝈𝑩#𝟏(𝑺)?

    • In general, we will need some way to estimate intermediate result set sizes

    Histogramsprovideawaytoefficientlystoreestimatesofthesequantities

  • Histograms

    • A histogram is a set of value ranges (“buckets”) and the frequencies of values in those buckets occurring

    • How to choose the buckets?– Equiwidth & Equidepth

    • Turns out high-frequency values are very important

  • 0

    5

    10

    1 2 3 4 5 6 7 8 9 101112131415Values

    Frequency

    Howdowecomputehowmanyvaluesbetween8and10?(Yes,it’sobvious)

    Problem:countstakeuptoomuchspace!

    Example

  • Full vs. Uniform Counts

    0

    2

    4

    6

    8

    10

    1 2 3 4 5 6 7 8 9 101112131415

    Howmuchspacedothefullcounts(bucket_size=1)take?

    Howmuchspacedotheuniformcounts(bucket_size=ALL)take?

  • Fundamental Tradeoffs

    • Want high resolution (like the full counts)

    • Want low space (like uniform)

    • Histograms are a compromise!

    Sohowdowecomputethe“bucket”sizes?

  • Equi-width

    0

    5

    10

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

    Allbucketsroughlythesamewidth

  • Equidepth

    02468

    10

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

    Allbucketscontainroughlythesamenumberofitems(totalfrequency)

  • Histograms

    • Simple, intuitive and popular

    • Parameters: # of buckets and type

    • Can extend to many attributes (multidimensional)

  • Maintaining Histograms

    • Histograms require that we update them!– Typically, you must run/schedule a command to update statistics

    on the database–Out of date histograms can be terrible!

  • Nasty example

    0

    5

    10

    1 2 3 4 5 6 7 8 9 101112131415

    1.weinsertmanytuples withvalue>162.wedonotupdatethehistogram3.weaskforvalues>20?

  • Compressed Histograms

    • One popular approach: 1. Store the most frequent values and their counts explicitly2. Keep an equiwidth or equidepth one for the rest of the values

    Peoplecontinuetotryallmanneroffancinessherewavelets,graphicalmodels,entropymodels,…

  • Acknowledgement

    • Some of the slides in this presentation are taken from the slides provided by the authors.

    • Many of these slides are taken from cs145 course offered byStanford University.

    CSC261,Spring2017,UR