Towards Universally Robust Plans Perspective Seminar · Debuggers Interpreters Integrated...
Transcript of Towards Universally Robust Plans Perspective Seminar · Debuggers Interpreters Integrated...
Developing Software tools in Academia
Deepali Nemade
Anshuman Dutt Database Systems Lab
In the real world
Software Types
Software
System Software
Embedded Software
Programming Software
Application Software
Operating Systems Device drivers Server management
BIOS Washing machine control Routing utility
Compilers Debuggers Interpreters Integrated Development environment
Word processor Image/video editing Video games Simulation software Databases Mathematical software Medical software Computer-aided design Educational software Industrial automation Decision-making software
Free software
Proprietary software
Database software
• A general-purpose DBMS is a software system designed to allow the definition, creation, querying, update, and administration of databases. – Oracle
– SAP
– Microsoft Access
– IBM DB2
– Microsoft SQL Server
– HP Non-stop SQL/MX
– PostgreSQL
– MySQL
– SQLite
Free software
Proprietary software
DBMS architecture
Query Optimizer
• The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.
• There is a trade-off between – the amount of time spent figuring out the best query plan
– the quality of the plan
• Hence, provide – a "good enough" plan (performance comparable to the best)
– in a reasonable time
• The role of query optimizers has become especially critical in recent times due to the high degree of query complexity – data warehousing and mining over databases
Query Plan Selection
• Core technique
Query (Q) Query Optimizer (dynamic programming)
Minimum Cost
Plan P(Q)
DB catalogs
Cost Model
Search Space
Need for careful plan selection
• Cost difference between best plan choice and a random choice can be enormous (orders of magnitude!)
• Only a small percentage of really good plans over the (exponential) search space
Tools Developed
• PICASSO
– A tool developed at DSL – Best Demo in VLDB 2010
– Showed that current query optimizers have become much complex
than expected VLDB-2005, VLDB 2008
– The complexity of behavior is not actually required, we showed the
possibility of simplification – VLDB 2007
• CODD
– One more tool developed at DSL
– Testing of query optimizers do not need data but only statistics, so
we can create DATALESS DATABASES – DBTEST 2012
Analyze the behavior of the optimizer for a given data instance
Creates futuristic testing scenarios (impractical otherwise)
Picasso: Drawing Out the Artistic Talents of DB Query Optimizers
Mr. Query Optimizer
See, I am a painter too !!
Parametric query
Select *
From authors, publishers, sources ….
Where language = ‘ENG’ and year between ‘1997’ and ‘1999’
and lastname like ‘Autier’ and title like ‘melanoma’ …
Parametric query optimization (PQO)
• It attempts to identify several execution plans
– each one of which is optimal for a subset of all possible values of the run-time parameters
– termed as Parametric Optimal Set of Plans (POSP)
• At run time, when the parameter values are known
– a simple lookup using parameters to identify appropriate plan from POSP
– avoids full scale query optimization for each query instance
– hence saves optimization time
Expectations Small number of plans in POSP Plan choices will not change frequently
Query Template [Q7 of TPC-H]
select supp_nation, cust_nation, l_year, sum(volume) as revenue from (select n1.n_name as supp_nation, n2.n_name as cust_nation, extract(year from l_shipdate) as l_year, l_extendedprice * (1 - l_discount) as volume from supplier, lineitem, orders, customer, nation n1, nation n2 where s_suppkey = l_suppkey and o_orderkey = l_orderkey and c_custkey = o_custkey and s_nationkey = n1.n_nationkey and c_nationkey = n2.n_nationkey and ((n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY') or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE')) and l_shipdate between date '1995-01-01' and date '1996-12-31'
group by supp_nation, cust_nation, l_year order by supp_nation, cust_nation, l_year
and o_totalprice ≤ C1 and c_acctbal ≤ C2 ) as shipping
Determines the value of goods shipped between nations in a time period
Value determines selectivity of ORDERS
relation
Value determines selectivity of
CUSTOMER relation
Parametric Space
Selectivity
Sele
ctiv
ity
Plan Diagram Generation Process
ORDERS.o_totalprice
CUSTOMER.
c_acctbal
Sample Plan Diagram [QT7,OptB]
Plan P1 Plan P3 Plan P5
Sample Cost Diagram [QT7,OptB]
MinCost: 6.08E3 MaxCost: 3.24E4
The Picasso Connection
Plan diagrams are often similar to cubist paintings ! [ Pablo Picasso founder of cubist genre ]
Woman with a guitar
Georges Braque, 1913
Complex Plan Diagram [QT8,OptA*]
Extremely fine-grained coverage (P76 ~ 0.01%)
Highly irregular plan boundaries
Intricate Complex Patterns
# of plans: 76
Increases to 90 plans with 300x300 grid !
The Picasso Connection
Picasso Architecture
Picasso Client
Picasso Server
DATABASE ENGINE
Engine-Specific (Plan Information, Statistics Information)
Column Statistics Plan Tree EXPLAIN
QUERY
STATS QUERY
visualization
Repeated several times
Overview
Picasso is a Java tool that, given a multi-dimensional SQL query template and a choice of database engine, automatically generates plan diagram and cost diagram
– Fires queries at user-specified granularity (10, 30, 100, 300, 1000 queries per dimension)
– Visualization: 2-D plan diagrams (slices if n > 2) 3-D cost and card diagrams Also: Plan-trees, Plan differences
– >60000 lines of code (2004-12) with ~100 classes
– Uses Java3D, VisAd, JGraph, Swing
Tool Status • Operational on DB2/Oracle/SQLServer/Sybase/PostgreSQL
• Copyrighted by IISc in May 2006
• Released as free software in Nov 2006 by Associate Director of IISc
• Release of version 1.0 in May 2007, version 2.0 in Feb 2009, (version 3.0 in 2013)
• In use at academic and industrial labs worldwide – CMU, Purdue, Duke, TU Munich, NU Singapore, IIT-B, …
– IBM, Microsoft, Oracle, Sybase, HP, …
• Received Best Software award in Very Large Data Base (VLDB) conference, 2010
Why do they care?
• Excited the interest of industrial and academic communities
– serious problems and anomalies in current optimizer design
• optimizer evaluator / debugger / designer
• database administrators – response time fault profiler
– testbed for database researchers
– educational aid for students
Why do we care ?
• Not software development for its own sake
• Development of the tool has thrown up many core CS research problems involving theory, algorithms, statistics, tree matching …
– One of these we will discuss next
Plan Diagram Reduction
Can the plan diagram be recolored with a smaller set of colors (i.e. some plans are “swallowed” by others), such that Guarantee:
No query point in the original diagram has its cost increased, post-swallowing, by more than λ percent (user-defined)
Analogy: Sri Lanka agrees to be annexed by India if it is assured that the cost of living of each Lankan citizen is not increased by more than λ percent
Complex Plan Diagram [QT8, OptA*] Reduced Plan Diagram [λ=10%]
Reduced to 5 plans from 76 !
Comparatively smoother contours
Note:
• A 10% threshold is well within the confidence intervals of the cost estimates of modern optimizers
• The degradation value is an upper bound the average degradation is much lower in practice
Demo Diagrams
• Plan Diagram
• Cost Diagram
• Reduced Plan-Diagram
• Plan-tree Diagram (qualitative / quantitative)
• Plan-difference Diagram
Picasso Demo
CODD: COnstructing Dataless Databases
Average Joe Mr. CODD
Why do we need of such tool?
• Want to construct futuristic metadata scenario for effective testing:
Test the query optimizer for metadata corresponding to 100 GB data scenario
DATA
METADATA Lets create a 100 GB data
scenario CODD
Impossible! I can directly
create metadata
CODD Overview
Database Engine
Metadata Store
CODD Metadata Processor
Vendor Neutral
Interface
1. Metadata Construction
2. Metadata Retention
3. Metadata Porting
4. Metadata Scaling
• DB2 • Oracle • SQL Server
• Provides interface for ab-initio creation of metadata
Metadata Construction
Data
Metadata CODD
Interface
DB Tester 1. Create relation schemas
2. Input metadata values
3. Fill catalog tables
Metadata
Construct Mode Interface
Metadata Validation – Construct Mode
• Can user input arbitrary values? – No. The input metadata values must be
• legal (valid type and range)
• consistent with other metadata values
• Verification approach – Construct a directed acyclic constraint graph CG(V, E)
• V represents the set of metadata entities and its structural constraints
• E represents the consistency constraints
– Run topological sort on CG to obtain CGlinear . Complexity: O(|V| +|E|)
– Force the user through the linear ordering and ensure that the constraints are met along the linear ordering
e.g. Column Cardinality Integer ; -1 or Colcard > 0 Colcard <= Card
Metadata Validation
Relation level metadata Index level metadata
Column level metadata
Overflow (4)
FPages (3)
Card (1) Integer type >= 0 (or) -1
NPages (2)
NLeaf (16)
IndCard (12)
Page_Fetch_Pairs (14)
ClusterFactor (13)
NumNULLS (6)
Colcard (5)
Quantile Value Distribution (11)
Frequency Value Distribution (10)
High2Key (9)
NLevels (17)
Density (19)
NumRIDs (15)
Num_Empty_Leafs (18)
AvgColLenChar (7)
Low2Key (8)
DB2 Directed Acyclic Constraint Graph
Super Nodes
Legality Constraint
Consistency constraint
Signifies Order
Additional constraints
Metadata Validation
DB2 Constraint Graph – Super nodes expanded
Quantile Value Distribution Frequency Value Distribution
ColValue (1) ValCount (2) DistCount (3)
ColValue (4) DistCount (6)
ColValue (7) DistCount (9)
ColValue (55) DistCount (57)
ColValue (58) DistCount (60)
ColValue (1) ValCount (2)
ColCard (5) Card (1)
ValCount (5)
ValCount (8)
ValCount (56)
ValCount (59)
ColValue (3)
ColValue (5)
ColValue (19)
ValCount (4)
ValCount (6)
ValCount (20)
High2Key (9) Low2Key (8) High2Key (9) Low2Key (8) ColCard (5) Card (1)
... …
...
... ... ...
... …
... …
Space overheads due to data stored
• What if we have only the data and not the metadata?
Drop Mode
DATA
METADATA
DATA
I will remove data from database
No space overheads during testing process
Mr. CODD
METADATA
Porting Mode
• Supports transfer of metadata statistics from one engine to another
DB Engine 1
DB Engine 2
Read Catalogs
Write Catalogs
Metadata Metadata
Scale Mode
• Database engine testing had always involved the testing on scaled database instances – TPC-H benchmark provides 1GB – 100TB dataset
• Can we directly produce scaled version of metadata?
• After scaling data remains same, only metadata is scaled
Metadata corresponding
to 1 GB database
Metadata corresponding
to 100 GB database
CODD
CODD in Action
• Following experiment shows how we can easily access , using CODD, the optimizer’s altered behavior in response to futuristic scenarios
32 Plans 77 Plans
QT 9 Plan diagrams (Baseline and Scaled)
CODD in Action
• By iteratively executing CODD on popular commercial query optimizer, with the database size increasing in each iteration, it was discovered that the cardinality estimation module “saturated” when the input data size exceeds 10e19 bytes, - no mention of this threshold was found in publically available documents of the system
Overview
• CODD is a java based graphical tool that supports ab-initio creation, retention, scaling and porting of metadata statistics
– Operational on DB2/Oracle/SQLServer/Sybase/PostgreSQL
– Around 40,000 LOC
– Released in 2012
– Accepted at DBTest, 2012
– Awarded at IBM-ICARE 2012.
Construct Mode Demo
• Database engine: DB2
• TPC-H Schema is created (Data is not loaded)
• Relation part is chosen to construct from scratch
Construct Mode Demo
Thank for the attention!
Questions ?