Vertica mpp columnar dbms

Vertica

Zvika Gutkin

DB Expert

[email protected]

Agenda

• What is Vertica.

• How does it work.

• How To Use Vertica … (The Right Way ).

• Where It Falls Short.

• Examples …

MPP-Columnar DBMS

10x –100x performance of classic RDBMS.

Linear Scale

SQL

Commodity Hardware

Built-in fault tolerance

10x –100x performance of classic RDBMS

• Column store architecture

• High Compression rates

• Sorted columns

• Objects Segmentation/Replication.

How Does It Work ?

Tuple Mover

Delete

• Deleted rows are only marked as deleted.

• Stored in delete vector on disk.

• Query merge the ROS and Deleted vector to remove deleted records.

• Data is removed asynchronously during mergeout.

Projections

• Physical structure of the table (logical)

• Stored sorted and compressed

• Internal maintenance

• At least one (super) projection.

• Projection Types: – Super projection

– Query specific projection

– Pre join projection

– Buddy projection

Projections

What‘s Important ….

• Choose the right columns (General Vs Specific).

• Choose the right sort order .

• Choose the right encoding .

• Choose the right column to partition by .

• Choose the right column to segment by .

Where It Falls Short …

• Lack of Features .

• Good for specific types of queries .

– Keep Queries Simple .

– Use the right columns

– Use Order By to help optimizer pick the right projection

– Check the join column – Best if both tables order by it .

– Check the join column – best if segmented by it.

Choose the Right sort order Example

select

a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,

count(distinct a11.VS_LP_SESSION_ID) AS Visits,

(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1

from lp_15744040.FACT_VISIT_ROOM a11

group by

a11.LP_ACCOUNT_ID;

First projection …. table_name projection_name projection_column_name column_position sort_position

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VS_LP_SESSION_ID 0 0

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad LP_ACCOUNT_ID 1 1

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VS_LP_VISITOR_ID 2 2

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_FROM_DT_TRUNC 3 3

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad ACCOUNT_ID 4 4

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad ROOM_ID 5 5

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_FROM_DT_ACTUAL 6 6

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_TO_DT_ACTUAL 7 7

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad HOT_LEAD_IND 8 8

Access Path: +-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.LP_ACCOUNT_ID | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 7M, Rows: 10K] (PATH ID: 2) | | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3) | | | Projection: lp_15744040.FACT_VISIT_ROOM_bad | | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID

Second projection … table_name projection_name projection_column_name column_position sort_position

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 LP_ACCOUNT_ID 0 0

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VS_LP_SESSION_ID 1 1

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VS_LP_VISITOR_ID 2 2

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_FROM_DT_TRUNC 3 3

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 ACCOUNT_ID 4 4

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 ROOM_ID 5 5

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_FROM_DT_ACTUAL 6 6

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_TO_DT_ACTUAL 7 7

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 HOT_LEAD_IND 8 8

Access Path: +-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.LP_ACCOUNT_ID | +---> GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 2) | | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3) | | | Projection: lp_15744040.FACT_VISIT_ROOM_fix1 | | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID

Results …

Elapsed Time First projection GROUPBY HASH (SORT OUTPUT)

Time: First fetch (7 rows): 264527.916 ms. All rows formatted: 264527.978 ms

Elapsed Time Second projection GROUPBY PIPELINED


Join Example select a12.DT_WEEK AS DT_WEEK,




from zzz.FACT_VISIT a11

join zzz.DIM_DATE_TIME a12

on (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)

where (a11.LP_ACCOUNT_ID in ('57386690')

and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')

group by a12.DT_WEEK,

a11.LP_ACCOUNT_ID

Filter : LP_ACCOUNT_ID, VISIT_FROM_DT_TRUNC Group By : DT_WEEK , LP_ACCOUNT_ID Join: VISIT_FROM_DT_TRUNC , DATE_TIME_ID Select : DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID

Full Explain Plan… Access Path:

+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID: 1)

| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)

| Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID

| Execute on: All Nodes

| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)

| | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID

| | Execute on: All Nodes

| | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISTICS)] (PATH ID: 3)

| | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)

| | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID

| | | Execute on: All Nodes

| | | +-- Outer -> STORAGE ACCESS for a11 [Cost: 421K, Rows: 372M (NO STATISTICS)] (PATH ID: 4)

| | | | Projection: zzz.FACT_VISIT_b0

| | | | Materialize: a11.VISIT_FROM_DT_TRUNC

| | | | Filter: (a11.LP_ACCOUNT_ID = '57386690')

| | | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (a11.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp))

| | | | Execute on: All Nodes

| | | +-- Inner -> STORAGE ACCESS for a12 [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 5)

| | | | Projection: zzz.DIM_DATE_TIME_node0004

| | | | Materialize: a12.DATE_TIME_ID, a12.DT_WEEK

| | | | Filter: ((a12.DATE_TIME_ID >= '2011-09-01 15:28:00'::timestamp) AND (a12.DATE_TIME_ID <= '2011-12-31 12:52:50'::timestamp))

| | | | Execute on: All Nodes

Explain Plan (substract)… Access Path:l

+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID: 1)

| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)

| Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID


| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)

| | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID


| | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISlTICS)] (PATH ID: 3)

| | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)

| | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID

| | | Execute on: All Nodes


Solution one - Functions select week(a11.VISIT_FROM_DT_TRUNC) AS DT_WEEK,




from zzz.FACT_VISIT a11



group by week(a11.VISIT_FROM_DT_TRUNC),

a11.LP_ACCOUNT_ID;

Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 127, Rows: 1 (STALE STATISTICS)] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: <SVAR>, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 126, Rows: 1 (STALE STATISTICS)] (PATH ID: 2) | | Group By: (date_part('week', a11.VISIT_FROM_DT_TRUNC))::int, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for a11 [Cost: 125, Rows: 1 (STALE STATISTICS)] (PATH ID: 3) | | | Projection: zzz.FACT_VISIT_b0 Time: First fetch (6 rows): 33453.997 ms. All rows formatted: 33454.154 ms

Saved the Join Time

Solution Two- PreJoin Projection

Pros

• Eliminate Join overhead

• Maintain By Vertica

Cons

• Not Flexible

• Cause Overhead on Load

• Need Primary/Foreign Key

• Maintenance Restrictions

Access Path:

+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 12K, Rows: 10K] (PATH ID: 1)

| Aggregates: count(DISTINCT visit_date_time_prejoin8_b0.VS_LP_SESSION_ID)

| Group By: visit_date_time_prejoin8_b0.DT_WEEK, visit_date_time_prejoin8_b0.LP_ACCOUNT_ID


| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 11K, Rows: 10K] (PATH ID: 2)

| | Group By: visit_date_time_prejoin8_b0.DT_WEEK, visit_date_time_prejoin8_b0.LP_ACCOUNT_ID, visit_date_time_prejoin8_b0.VS_LP_SESSION_ID


| | +---> STORAGE ACCESS for <No Alias> [Cost: 8K, Rows: 1M] (PATH ID: 3)

| | | Projection: lp_15744040.visit_date_time_prejoin8_b0

Solution Two- PreJoin Projection order by LP_ACCOUNT_ID,VISIT_FROM_DT_TRUNC,DT_WEEK,HOT_LEAD_IND,DATE_TIME_ID,VS_LP_SESSION_ID

Time: First fetch (6 rows): 35312.331 ms. All rows formatted: 35312.421 ms Saved the Join Time

Access Path:

+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 542K, Rows: 10K] (PATH ID: 1)

| Aggregates: count(DISTINCT visit_date_time_prejoin_z6.VS_LP_SESSION_ID)

| Group By: visit_date_time_prejoin_z6.DT_WEEK, visit_date_time_prejoin_z6.LP_ACCOUNT_ID


| +---> GROUPBY PIPELINED [Cost: 542K, Rows: 10K] (PATH ID: 2)

| | Group By: visit_date_time_prejoin_z6.DT_WEEK, visit_date_time_prejoin_z6.VS_LP_SESSION_ID, visit_date_time_prejoin_z6.LP_ACCOUNT_ID


| | +---> STORAGE ACCESS for <No Alias> [Cost: 501K, Rows: 15M] (PATH ID: 3)

| | | Projection: lp_15744040.visit_date_time_prejoin_z6

| |

Solution Two- PreJoin Projection Sorted By DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID

Time: First fetch (6 rows): 3680.853 ms. All rows formatted: 3680.969 ms Saved the Join Time and Group by hash Time

Solution Three - Denormalize select DT_WEEK,




from zzz.FACT_VISIT_Z1 a11



group by DT_WEEK,

a11.LP_ACCOUNT_ID;

Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 2) | | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for a11 [Cost: 2M, Rows: 372M (NO STATISTICS)] (PATH ID: 3) | | | Projection: zzz.FACT_VISIT_Z1_super

Time: First etch (6 rows): 33885.178 ms. All rows formatted: 33885.253 ms Saved the Join Time

• Changing the projection sort order

Solution Three - Denormalize

Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 588K, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY PIPELINED [Cost: 587K, Rows: 10K] (PATH ID: 2) | | Group By: a11.DT_WEEK, a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for a11 [Cost: 531K, Rows: 20M] (PATH ID: 3) | | | Projection: zzz.fact_visit_z1_pipe | | | Materialize: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | | Filter: (a11.LP_ACCOUNT_ID = '57386690') | | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (a11.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp)) | | | Execute on: All Nodes Time: First fetch (6 rows): 4313.497 ms. All rows formatted: 4313.600 ms

Saved the Join Time and Group by hash Time

•Keep it simple

•Keep it sorted.

•Keep it joinless

Let’s sum it up…

Questions ?

Thank You

Vertica mpp columnar dbms

Documents

Transcript of Vertica mpp columnar dbms