Denodo DataFest 2017: Denodo 7.0 Demo. Centralized Self-Service Search and Discovery
Denodo Data Virtualization Platform Architecture: Performance (session 2 from Architect to Architect...
-
Upload
denodo -
Category
Data & Analytics
-
view
181 -
download
0
Transcript of Denodo Data Virtualization Platform Architecture: Performance (session 2 from Architect to Architect...
Architect-to-Architect Series
■ Series of five webinars over next 2 months
■ Deeper look into Denodo Platform
■ Architectural Overview
■ Performance (today’s session)
■ Scalability
■ Data Discovery and Governance
■ Security
Denodo Express
■ Denodo Express
■ Free to Download
■ Fully functioning Data Virtualization Platform
■ Single user, supports common data sources
■ Many of the same capabilities of Denodo Platform
■ Performance, Data Discovery, Governance, internal Security, Publishing, …
Performance – Optimizer, etc.
■ Optimizer
■ The Optimizer applies state-of-the-art optimization techniques to relational and non-relational sources.
■ Query Plan Generator
■ The Plan Generator is in charge of generating possible execution plans for the query and selecting the optimum one.
■ Execution Engine
■ Responsible for executing the selected query plan, executing the necessary sub-queries on the sources (or collecting data from cache as appropriate) and integrating the results to generate the global response.
Performance Optimization
■ Advanced Query Optimization
■ Cost and Source Constraint Based Query Plans
■ Query Delegation
■ Automatic Query Rewriting
■ Join Optimizations
■ Data Movement
■ Asynchronous Multi-threaded Processing
■ Server Throttling Mechanisms
■ Scalability
■ Caching
■ Multiple configuration modes – full or partial
Static vs. Dynamic Optimization
■ Static optimization
■ Takes place before query is executed
■ Rewrite query in more optimal way
■ Push-down delegation
■ Optimize query by – where possible – pushing down sub-trees to underlying data source
■ Delegate functions to underlying data source
■ Dynamic optimization
■ Use statistics and indices to estimate costs of alternative execution plans
■ Select Join methods and Join ordering
Cost-based Optimization
■ Objective – select best execution method for each operation
■ Estimate query costs based on:
■ View statistics
■ No. of rows, row size, for each field: max value, min value, no. of different values, …
■ View indices
■ Available indices, type of indices (clustered, hash, …)
■ Data source I/O information
■ Block size, blocks/read operation, data transfer rate, …
Source Constraint Optimization
■ Denodo Platform optimization has to work across multiple diverse data source types
■ Not just relational databases
■ Not all data sources have same capabilities
■ Recognize and optimize for constraints in underlying data sources
■ e.g. MySQL can be ordered for Merge join…but a delimited file cannot
Query Delegation
■ Objective – Push the processing to the data
■ Utilize power and optimizations of underlying data sources
■ Especially relational databases and data warehouses
■ Minimize expensive data movement
■ Delegation mechanisms
■ Vendor specific SQL dialect
■ Function delegation
■ Configurable by data source
■ Delegate SQL operations
■ e.g. Join, Union, Group By, Order By, etc.
Automatic Query Rewriting
■ Objective – Rewrite query in a more optimal manner before the query is executed
■ Static optimization technique
■ Typical optimizations:
■ Simplify partitioned unions
■ Remove redundant sub-views
■ Transform outer joins to inner joins
■ Static join reordering to maximize delegation
Simplify Partitioned Unions
Select * from Sales_Product where region=‘NA'
North America
EMEA
Sales_NA Product_EMEA
North America
Product_NA
EMEA
Sales_EMEA
U U
|><|
S S S S
region=‘NA' region=‘NA' region=‘EMEA' region=‘EMEA'
Join cannot be delegated
Simplify Partitioned Unions (Cont’d)
Select * from Sales_Product where region=‘NA'
North America
Sales_NA
North America
Product_NA
U U
|><|
S S
region=‘NA' region=‘NA'
Join can be delegated
Transform Outer Joins to Inner Joins
As a.iinc_id = c.pinc_id ∴ c.pinc_id cannot be null
DS2
internet_inc
DS3
phone_inc
DS1
Internet_inc
||><|
|><|
S
b c
a.iinc_id = c.pinc_id
a
Transform Outer Joins to Inner Joins
As a.iinc_id = c.pinc_id ∴ c.pinc_id cannot be null
DS2
internet_inc
DS3
phone_inc
DS1
Internet_inc
||><|
|><|
S
b c
a.iinc_id = c.pinc_id
a
The left outer is equivalent to an inner join
|><|
Join Optimizations
■ Multiple Join options:
■ Merge
■ Nested
■ Nested Parallel
■ Hash
■ Optimizer automatically selects based on statistics and source capabilities
■ e.g. when using databases joining two large datasets, Merge Join is preferred
■ e.g. if one dataset is significantly larger, use Nested Join
Data Movement
■ Typically used when one dataset is significantly smaller and aggregations performed on joined data
1. Execute query in DS1 and fetch its data
2. Create a temporary table in DS2 and insert data from step 1
3. When step 2 is completed, execute the JOIN in DS2 and return the results to the DV layer
DS1
DS2
Query Plans
■ Optimizer calculates cost of multiple plans and selects ‘best’ plan
■ Cost estimates:
1. Traverse query tree top-down looking for ‘interesting’ patterns
• e.g. ‘GROUP BY region’ can execute faster if rows arrive ordered by ‘region’
2. Estimate costs of sub-queries on data sources
• Use source statistics and constraints
3. Traverse tree bottom-up to calculate costs for each node
• Choose execution with minimum cost
• Remember ‘interesting’ patterns (overall cost vs. node cost)
Other Optimization Techniques
■ Asynchronous Multi-threaded Processing
■ Execute multiple queries in parallel
■ Server Throttling Mechanisms
■ Controls to limit concurrency
■ Waiting queues for inbound connections
■ Connection pools for data sources
■ Swapping data to disk to handle large datasets
Caching
■ Caching – for slow sources and protect operational data sources
■ Caching enabled at view level
■ Enables mixed mode query plans
■ Caching modes
■ Full – all data in cache
■ Partial – query-by-query
■ Manual refresh or automated refresh
Data Virtualization – Next Steps
Move forward at your own pace
Download Denodo Express –
The fastest way to Data Virtualization
Denodo Community: Documents, Videos, Tutorials, more.
Attend Architect-to-Architect Series
Performance
Scalability
Move forward with one of our Data Virtualization experts
Phone: (+1) 877-556-2531 (NA)
Phone: (+44) (0)20 7869 8053 (EMEA)
Email: [email protected] | www.denodo.com
Data Discovery and Governance
Security
Five In-depth Technology and Architecture Sessions
on Data Virtualization
Thank You!
Next Session Session 3
Denodo Platform: Scalability