Vertica mpp columnar dbms
-
Upload
zvika-gutkin -
Category
Documents
-
view
2.232 -
download
2
description
Transcript of Vertica mpp columnar dbms
Agenda
• What is Vertica.
• How does it work.
• How To Use Vertica … (The Right Way ).
• Where It Falls Short.
• Examples …
MPP-Columnar DBMS
10x –100x performance of classic RDBMS.
Linear Scale
SQL
Commodity Hardware
Built-in fault tolerance
10x –100x performance of classic RDBMS
• Column store architecture
• High Compression rates
• Sorted columns
• Objects Segmentation/Replication.
How Does It Work ?
Tuple Mover
Delete
• Deleted rows are only marked as deleted.
• Stored in delete vector on disk.
• Query merge the ROS and Deleted vector to remove deleted records.
• Data is removed asynchronously during mergeout.
Projections
• Physical structure of the table (logical)
• Stored sorted and compressed
• Internal maintenance
• At least one (super) projection.
• Projection Types: – Super projection
– Query specific projection
– Pre join projection
– Buddy projection
Projections
What‘s Important ….
• Choose the right columns (General Vs Specific).
• Choose the right sort order .
• Choose the right encoding .
• Choose the right column to partition by .
• Choose the right column to segment by .
Where It Falls Short …
• Lack of Features .
• Good for specific types of queries .
– Keep Queries Simple .
– Use the right columns
– Use Order By to help optimizer pick the right projection
– Check the join column – Best if both tables order by it .
– Check the join column – best if segmented by it.
Choose the Right sort order Example
select
a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
count(distinct a11.VS_LP_SESSION_ID) AS Visits,
(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1
from lp_15744040.FACT_VISIT_ROOM a11
group by
a11.LP_ACCOUNT_ID;
First projection …. table_name projection_name projection_column_name column_position sort_position
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VS_LP_SESSION_ID 0 0
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad LP_ACCOUNT_ID 1 1
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VS_LP_VISITOR_ID 2 2
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_FROM_DT_TRUNC 3 3
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad ACCOUNT_ID 4 4
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad ROOM_ID 5 5
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_FROM_DT_ACTUAL 6 6
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_TO_DT_ACTUAL 7 7
FACT_VISIT_ROOM FACT_VISIT_ROOM_bad HOT_LEAD_IND 8 8
Access Path: +-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.LP_ACCOUNT_ID | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 7M, Rows: 10K] (PATH ID: 2) | | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3) | | | Projection: lp_15744040.FACT_VISIT_ROOM_bad | | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
Second projection … table_name projection_name projection_column_name column_position sort_position
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 LP_ACCOUNT_ID 0 0
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VS_LP_SESSION_ID 1 1
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VS_LP_VISITOR_ID 2 2
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_FROM_DT_TRUNC 3 3
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 ACCOUNT_ID 4 4
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 ROOM_ID 5 5
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_FROM_DT_ACTUAL 6 6
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_TO_DT_ACTUAL 7 7
FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 HOT_LEAD_IND 8 8
Access Path: +-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.LP_ACCOUNT_ID | +---> GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 2) | | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3) | | | Projection: lp_15744040.FACT_VISIT_ROOM_fix1 | | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
Results …
Elapsed Time First projection GROUPBY HASH (SORT OUTPUT)
Time: First fetch (7 rows): 264527.916 ms. All rows formatted: 264527.978 ms
Elapsed Time Second projection GROUPBY PIPELINED
Time: First fetch (7 rows): 38913.909 ms. All rows formatted: 38913.965 ms
Join Example select a12.DT_WEEK AS DT_WEEK,
a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
count(distinct a11.VS_LP_SESSION_ID) AS Visits,
(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1
from zzz.FACT_VISIT a11
join zzz.DIM_DATE_TIME a12
on (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
where (a11.LP_ACCOUNT_ID in ('57386690')
and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')
group by a12.DT_WEEK,
a11.LP_ACCOUNT_ID
Filter : LP_ACCOUNT_ID, VISIT_FROM_DT_TRUNC Group By : DT_WEEK , LP_ACCOUNT_ID Join: VISIT_FROM_DT_TRUNC , DATE_TIME_ID Select : DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID
Full Explain Plan… Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID: 1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)
| | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISTICS)] (PATH ID: 3)
| | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
| | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID
| | | Execute on: All Nodes
| | | +-- Outer -> STORAGE ACCESS for a11 [Cost: 421K, Rows: 372M (NO STATISTICS)] (PATH ID: 4)
| | | | Projection: zzz.FACT_VISIT_b0
| | | | Materialize: a11.VISIT_FROM_DT_TRUNC
| | | | Filter: (a11.LP_ACCOUNT_ID = '57386690')
| | | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (a11.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp))
| | | | Execute on: All Nodes
| | | +-- Inner -> STORAGE ACCESS for a12 [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 5)
| | | | Projection: zzz.DIM_DATE_TIME_node0004
| | | | Materialize: a12.DATE_TIME_ID, a12.DT_WEEK
| | | | Filter: ((a12.DATE_TIME_ID >= '2011-09-01 15:28:00'::timestamp) AND (a12.DATE_TIME_ID <= '2011-12-31 12:52:50'::timestamp))
| | | | Execute on: All Nodes
Explain Plan (substract)… Access Path:l
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID: 1)
| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)
| Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)
| | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISlTICS)] (PATH ID: 3)
| | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)
| | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID
| | | Execute on: All Nodes
Time: First fetch (6 rows): 56654.894 ms. All rows formatted: 56654.988 ms
Solution one - Functions select week(a11.VISIT_FROM_DT_TRUNC) AS DT_WEEK,
a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
count(distinct a11.VS_LP_SESSION_ID) AS Visits,
(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1
from zzz.FACT_VISIT a11
where (a11.LP_ACCOUNT_ID in ('57386690')
and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')
group by week(a11.VISIT_FROM_DT_TRUNC),
a11.LP_ACCOUNT_ID;
Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 127, Rows: 1 (STALE STATISTICS)] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: <SVAR>, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 126, Rows: 1 (STALE STATISTICS)] (PATH ID: 2) | | Group By: (date_part('week', a11.VISIT_FROM_DT_TRUNC))::int, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for a11 [Cost: 125, Rows: 1 (STALE STATISTICS)] (PATH ID: 3) | | | Projection: zzz.FACT_VISIT_b0 Time: First fetch (6 rows): 33453.997 ms. All rows formatted: 33454.154 ms
Saved the Join Time
Solution Two- PreJoin Projection
Pros
• Eliminate Join overhead
• Maintain By Vertica
Cons
• Not Flexible
• Cause Overhead on Load
• Need Primary/Foreign Key
• Maintenance Restrictions
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 12K, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT visit_date_time_prejoin8_b0.VS_LP_SESSION_ID)
| Group By: visit_date_time_prejoin8_b0.DT_WEEK, visit_date_time_prejoin8_b0.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 11K, Rows: 10K] (PATH ID: 2)
| | Group By: visit_date_time_prejoin8_b0.DT_WEEK, visit_date_time_prejoin8_b0.LP_ACCOUNT_ID, visit_date_time_prejoin8_b0.VS_LP_SESSION_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for <No Alias> [Cost: 8K, Rows: 1M] (PATH ID: 3)
| | | Projection: lp_15744040.visit_date_time_prejoin8_b0
Solution Two- PreJoin Projection order by LP_ACCOUNT_ID,VISIT_FROM_DT_TRUNC,DT_WEEK,HOT_LEAD_IND,DATE_TIME_ID,VS_LP_SESSION_ID
Time: First fetch (6 rows): 35312.331 ms. All rows formatted: 35312.421 ms Saved the Join Time
Access Path:
+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 542K, Rows: 10K] (PATH ID: 1)
| Aggregates: count(DISTINCT visit_date_time_prejoin_z6.VS_LP_SESSION_ID)
| Group By: visit_date_time_prejoin_z6.DT_WEEK, visit_date_time_prejoin_z6.LP_ACCOUNT_ID
| Execute on: All Nodes
| +---> GROUPBY PIPELINED [Cost: 542K, Rows: 10K] (PATH ID: 2)
| | Group By: visit_date_time_prejoin_z6.DT_WEEK, visit_date_time_prejoin_z6.VS_LP_SESSION_ID, visit_date_time_prejoin_z6.LP_ACCOUNT_ID
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for <No Alias> [Cost: 501K, Rows: 15M] (PATH ID: 3)
| | | Projection: lp_15744040.visit_date_time_prejoin_z6
| |
Solution Two- PreJoin Projection Sorted By DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID
Time: First fetch (6 rows): 3680.853 ms. All rows formatted: 3680.969 ms Saved the Join Time and Group by hash Time
Solution Three - Denormalize select DT_WEEK,
a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,
count(distinct a11.VS_LP_SESSION_ID) AS Visits,
(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1
from zzz.FACT_VISIT_Z1 a11
where (a11.LP_ACCOUNT_ID in ('57386690')
and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')
group by DT_WEEK,
a11.LP_ACCOUNT_ID;
Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 2) | | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for a11 [Cost: 2M, Rows: 372M (NO STATISTICS)] (PATH ID: 3) | | | Projection: zzz.FACT_VISIT_Z1_super
Time: First etch (6 rows): 33885.178 ms. All rows formatted: 33885.253 ms Saved the Join Time
• Changing the projection sort order
Solution Three - Denormalize
Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 588K, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY PIPELINED [Cost: 587K, Rows: 10K] (PATH ID: 2) | | Group By: a11.DT_WEEK, a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for a11 [Cost: 531K, Rows: 20M] (PATH ID: 3) | | | Projection: zzz.fact_visit_z1_pipe | | | Materialize: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | | Filter: (a11.LP_ACCOUNT_ID = '57386690') | | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (a11.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp)) | | | Execute on: All Nodes Time: First fetch (6 rows): 4313.497 ms. All rows formatted: 4313.600 ms
Saved the Join Time and Group by hash Time
•Keep it simple
•Keep it sorted.
•Keep it joinless
Let’s sum it up…
Questions ?
Thank You