Towards Universally Robust Plans Perspective Seminar · Debuggers Interpreters Integrated...

Developing Software tools in Academia

Deepali Nemade

Anshuman Dutt Database Systems Lab

In the real world

Software Types

Software

System Software

Embedded Software

Programming Software

Application Software

Operating Systems Device drivers Server management

BIOS Washing machine control Routing utility

Compilers Debuggers Interpreters Integrated Development environment

Word processor Image/video editing Video games Simulation software Databases Mathematical software Medical software Computer-aided design Educational software Industrial automation Decision-making software

Free software

Proprietary software

Database software

• A general-purpose DBMS is a software system designed to allow the definition, creation, querying, update, and administration of databases. – Oracle

– SAP

– Microsoft Access

– IBM DB2

– Microsoft SQL Server

– HP Non-stop SQL/MX

– PostgreSQL

– MySQL

– SQLite

Free software

Proprietary software

DBMS architecture

Query Optimizer

• The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

• There is a trade-off between – the amount of time spent figuring out the best query plan

– the quality of the plan

• Hence, provide – a "good enough" plan (performance comparable to the best)

– in a reasonable time

• The role of query optimizers has become especially critical in recent times due to the high degree of query complexity – data warehousing and mining over databases

Query Plan Selection

• Core technique

Query (Q) Query Optimizer (dynamic programming)

Minimum Cost

Plan P(Q)

DB catalogs

Cost Model

Search Space

Need for careful plan selection

• Cost difference between best plan choice and a random choice can be enormous (orders of magnitude!)

• Only a small percentage of really good plans over the (exponential) search space

Tools Developed

• PICASSO

– A tool developed at DSL – Best Demo in VLDB 2010

– Showed that current query optimizers have become much complex

than expected VLDB-2005, VLDB 2008

– The complexity of behavior is not actually required, we showed the

possibility of simplification – VLDB 2007

• CODD

– One more tool developed at DSL

– Testing of query optimizers do not need data but only statistics, so

we can create DATALESS DATABASES – DBTEST 2012

Analyze the behavior of the optimizer for a given data instance

Creates futuristic testing scenarios (impractical otherwise)

Picasso: Drawing Out the Artistic Talents of DB Query Optimizers

Mr. Query Optimizer

See, I am a painter too !!

Parametric query

Select *

From authors, publishers, sources ….

Where language = ‘ENG’ and year between ‘1997’ and ‘1999’

and lastname like ‘Autier’ and title like ‘melanoma’ …

Parametric query optimization (PQO)

• It attempts to identify several execution plans

– each one of which is optimal for a subset of all possible values of the run-time parameters

– termed as Parametric Optimal Set of Plans (POSP)

• At run time, when the parameter values are known

– a simple lookup using parameters to identify appropriate plan from POSP

– avoids full scale query optimization for each query instance

– hence saves optimization time

Expectations Small number of plans in POSP Plan choices will not change frequently

Query Template [Q7 of TPC-H]

select supp_nation, cust_nation, l_year, sum(volume) as revenue from (select n1.n_name as supp_nation, n2.n_name as cust_nation, extract(year from l_shipdate) as l_year, l_extendedprice * (1 - l_discount) as volume from supplier, lineitem, orders, customer, nation n1, nation n2 where s_suppkey = l_suppkey and o_orderkey = l_orderkey and c_custkey = o_custkey and s_nationkey = n1.n_nationkey and c_nationkey = n2.n_nationkey and ((n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY') or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE')) and l_shipdate between date '1995-01-01' and date '1996-12-31'

group by supp_nation, cust_nation, l_year order by supp_nation, cust_nation, l_year

and o_totalprice ≤ C1 and c_acctbal ≤ C2 ) as shipping

Determines the value of goods shipped between nations in a time period

Value determines selectivity of ORDERS

relation

Value determines selectivity of

CUSTOMER relation

Parametric Space

Selectivity

Sele

ctiv

ity

Plan Diagram Generation Process

ORDERS.o_totalprice

CUSTOMER.

c_acctbal

Sample Plan Diagram [QT7,OptB]

Plan P1 Plan P3 Plan P5

Sample Cost Diagram [QT7,OptB]

MinCost: 6.08E3 MaxCost: 3.24E4

The Picasso Connection

Plan diagrams are often similar to cubist paintings ! [ Pablo Picasso founder of cubist genre ]

Woman with a guitar

Georges Braque, 1913

Complex Plan Diagram [QT8,OptA*]

Extremely fine-grained coverage (P76 ~ 0.01%)

Highly irregular plan boundaries

Intricate Complex Patterns

# of plans: 76

Increases to 90 plans with 300x300 grid !

The Picasso Connection

Picasso Architecture

Picasso Client

Picasso Server

DATABASE ENGINE

Engine-Specific (Plan Information, Statistics Information)

Column Statistics Plan Tree EXPLAIN

QUERY

STATS QUERY

visualization

Repeated several times

Overview

Picasso is a Java tool that, given a multi-dimensional SQL query template and a choice of database engine, automatically generates plan diagram and cost diagram

– Fires queries at user-specified granularity (10, 30, 100, 300, 1000 queries per dimension)

– Visualization: 2-D plan diagrams (slices if n > 2) 3-D cost and card diagrams Also: Plan-trees, Plan differences

– >60000 lines of code (2004-12) with ~100 classes

– Uses Java3D, VisAd, JGraph, Swing

Tool Status • Operational on DB2/Oracle/SQLServer/Sybase/PostgreSQL

• Copyrighted by IISc in May 2006

• Released as free software in Nov 2006 by Associate Director of IISc

• Release of version 1.0 in May 2007, version 2.0 in Feb 2009, (version 3.0 in 2013)

• In use at academic and industrial labs worldwide – CMU, Purdue, Duke, TU Munich, NU Singapore, IIT-B, …

– IBM, Microsoft, Oracle, Sybase, HP, …

• Received Best Software award in Very Large Data Base (VLDB) conference, 2010

Why do they care?

• Excited the interest of industrial and academic communities

– serious problems and anomalies in current optimizer design

• optimizer evaluator / debugger / designer

• database administrators – response time fault profiler

– testbed for database researchers

– educational aid for students

Why do we care ?

• Not software development for its own sake

• Development of the tool has thrown up many core CS research problems involving theory, algorithms, statistics, tree matching …

– One of these we will discuss next

Plan Diagram Reduction

Can the plan diagram be recolored with a smaller set of colors (i.e. some plans are “swallowed” by others), such that Guarantee:

No query point in the original diagram has its cost increased, post-swallowing, by more than λ percent (user-defined)

Analogy: Sri Lanka agrees to be annexed by India if it is assured that the cost of living of each Lankan citizen is not increased by more than λ percent

Complex Plan Diagram [QT8, OptA*] Reduced Plan Diagram [λ=10%]

Reduced to 5 plans from 76 !

Comparatively smoother contours

Note:

• A 10% threshold is well within the confidence intervals of the cost estimates of modern optimizers

• The degradation value is an upper bound the average degradation is much lower in practice

Demo Diagrams

• Plan Diagram

• Cost Diagram

• Reduced Plan-Diagram

• Plan-tree Diagram (qualitative / quantitative)

• Plan-difference Diagram

Picasso Demo

CODD: COnstructing Dataless Databases

Average Joe Mr. CODD

Why do we need of such tool?

• Want to construct futuristic metadata scenario for effective testing:

Test the query optimizer for metadata corresponding to 100 GB data scenario

DATA

METADATA Lets create a 100 GB data

scenario CODD

Impossible! I can directly

create metadata

CODD Overview

Database Engine

Metadata Store

CODD Metadata Processor

Vendor Neutral

Interface

1. Metadata Construction

2. Metadata Retention

3. Metadata Porting

4. Metadata Scaling

• DB2 • Oracle • SQL Server

• Provides interface for ab-initio creation of metadata

Metadata Construction

Data

Metadata CODD

Interface

DB Tester 1. Create relation schemas

2. Input metadata values

3. Fill catalog tables

Metadata

Construct Mode Interface

Metadata Validation – Construct Mode

• Can user input arbitrary values? – No. The input metadata values must be

• legal (valid type and range)

• consistent with other metadata values

• Verification approach – Construct a directed acyclic constraint graph CG(V, E)

• V represents the set of metadata entities and its structural constraints

• E represents the consistency constraints

– Run topological sort on CG to obtain CGlinear . Complexity: O(|V| +|E|)

– Force the user through the linear ordering and ensure that the constraints are met along the linear ordering

e.g. Column Cardinality Integer ; -1 or Colcard > 0 Colcard <= Card

Metadata Validation

Relation level metadata Index level metadata

Column level metadata

Overflow (4)

FPages (3)

Card (1) Integer type >= 0 (or) -1

NPages (2)

NLeaf (16)

IndCard (12)

Page_Fetch_Pairs (14)

ClusterFactor (13)

NumNULLS (6)

Colcard (5)

Quantile Value Distribution (11)

Frequency Value Distribution (10)

High2Key (9)

NLevels (17)

Density (19)

NumRIDs (15)

Num_Empty_Leafs (18)

AvgColLenChar (7)

Low2Key (8)

DB2 Directed Acyclic Constraint Graph

Super Nodes

Legality Constraint

Consistency constraint

Signifies Order

Additional constraints

Metadata Validation

DB2 Constraint Graph – Super nodes expanded

Quantile Value Distribution Frequency Value Distribution

ColValue (1) ValCount (2) DistCount (3)

ColValue (4) DistCount (6)




ColValue (1) ValCount (2)

ColCard (5) Card (1)

ValCount (5)

ValCount (8)

ValCount (56)

ValCount (59)

ColValue (3)

ColValue (5)

ColValue (19)

ValCount (4)

ValCount (6)

ValCount (20)

High2Key (9) Low2Key (8) High2Key (9) Low2Key (8) ColCard (5) Card (1)

... …

...

... ... ...

... …

... …

Space overheads due to data stored

• What if we have only the data and not the metadata?

Drop Mode

DATA

METADATA

DATA

I will remove data from database

No space overheads during testing process

Mr. CODD

METADATA

Porting Mode

• Supports transfer of metadata statistics from one engine to another

DB Engine 1

DB Engine 2

Read Catalogs

Write Catalogs

Metadata Metadata

Scale Mode

• Database engine testing had always involved the testing on scaled database instances – TPC-H benchmark provides 1GB – 100TB dataset

• Can we directly produce scaled version of metadata?

• After scaling data remains same, only metadata is scaled

Metadata corresponding

to 1 GB database

Metadata corresponding

to 100 GB database

CODD

CODD in Action

• Following experiment shows how we can easily access , using CODD, the optimizer’s altered behavior in response to futuristic scenarios

32 Plans 77 Plans

QT 9 Plan diagrams (Baseline and Scaled)

CODD in Action

• By iteratively executing CODD on popular commercial query optimizer, with the database size increasing in each iteration, it was discovered that the cardinality estimation module “saturated” when the input data size exceeds 10e19 bytes, - no mention of this threshold was found in publically available documents of the system

Overview

• CODD is a java based graphical tool that supports ab-initio creation, retention, scaling and porting of metadata statistics

– Operational on DB2/Oracle/SQLServer/Sybase/PostgreSQL

– Around 40,000 LOC

– Released in 2012

– Accepted at DBTest, 2012

– Awarded at IBM-ICARE 2012.

Construct Mode Demo

• Database engine: DB2

• TPC-H Schema is created (Data is not loaded)

• Relation part is chosen to construct from scratch

Construct Mode Demo

Thank for the attention!

Questions ?

Towards Universally Robust Plans Perspective Seminar · Debuggers Interpreters Integrated...

Documents

Transcript of Towards Universally Robust Plans Perspective Seminar · Debuggers Interpreters Integrated...