Vertica mpp columnar dbms

27
Vertica Zvika Gutkin DB Expert [email protected]

description

Introduction to vertica database.

Transcript of Vertica mpp columnar dbms

Page 1: Vertica mpp columnar dbms

Vertica

Zvika Gutkin

DB Expert

[email protected]

Page 2: Vertica mpp columnar dbms

Agenda

• What is Vertica.

• How does it work.

• How To Use Vertica … (The Right Way ).

• Where It Falls Short.

• Examples …

Page 3: Vertica mpp columnar dbms

MPP-Columnar DBMS

10x –100x performance of classic RDBMS.

Linear Scale

SQL

Commodity Hardware

Built-in fault tolerance

Page 4: Vertica mpp columnar dbms

10x –100x performance of classic RDBMS

• Column store architecture

• High Compression rates

• Sorted columns

• Objects Segmentation/Replication.

Page 5: Vertica mpp columnar dbms

How Does It Work ?

Page 6: Vertica mpp columnar dbms

Tuple Mover

Page 7: Vertica mpp columnar dbms

Delete

• Deleted rows are only marked as deleted.

• Stored in delete vector on disk.

• Query merge the ROS and Deleted vector to remove deleted records.

• Data is removed asynchronously during mergeout.

Page 8: Vertica mpp columnar dbms

Projections

• Physical structure of the table (logical)

• Stored sorted and compressed

• Internal maintenance

• At least one (super) projection.

• Projection Types: – Super projection

– Query specific projection

– Pre join projection

– Buddy projection

Page 9: Vertica mpp columnar dbms

Projections

Page 10: Vertica mpp columnar dbms

What‘s Important ….

• Choose the right columns (General Vs Specific).

• Choose the right sort order .

• Choose the right encoding .

• Choose the right column to partition by .

• Choose the right column to segment by .

Page 11: Vertica mpp columnar dbms

Where It Falls Short …

• Lack of Features .

• Good for specific types of queries .

– Keep Queries Simple .

– Use the right columns

– Use Order By to help optimizer pick the right projection

– Check the join column – Best if both tables order by it .

– Check the join column – best if segmented by it.

Page 12: Vertica mpp columnar dbms

Choose the Right sort order Example

select

a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,

count(distinct a11.VS_LP_SESSION_ID) AS Visits,

(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1

from lp_15744040.FACT_VISIT_ROOM a11

group by

a11.LP_ACCOUNT_ID;

Page 13: Vertica mpp columnar dbms

First projection …. table_name projection_name projection_column_name column_position sort_position

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VS_LP_SESSION_ID 0 0

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad LP_ACCOUNT_ID 1 1

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VS_LP_VISITOR_ID 2 2

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_FROM_DT_TRUNC 3 3

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad ACCOUNT_ID 4 4

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad ROOM_ID 5 5

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_FROM_DT_ACTUAL 6 6

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad VISIT_TO_DT_ACTUAL 7 7

FACT_VISIT_ROOM FACT_VISIT_ROOM_bad HOT_LEAD_IND 8 8

Access Path: +-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.LP_ACCOUNT_ID | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 7M, Rows: 10K] (PATH ID: 2) | | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3) | | | Projection: lp_15744040.FACT_VISIT_ROOM_bad | | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID

Page 14: Vertica mpp columnar dbms

Second projection … table_name projection_name projection_column_name column_position sort_position

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 LP_ACCOUNT_ID 0 0

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VS_LP_SESSION_ID 1 1

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VS_LP_VISITOR_ID 2 2

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_FROM_DT_TRUNC 3 3

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 ACCOUNT_ID 4 4

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 ROOM_ID 5 5

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_FROM_DT_ACTUAL 6 6

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 VISIT_TO_DT_ACTUAL 7 7

FACT_VISIT_ROOM FACT_VISIT_ROOM_fix1 HOT_LEAD_IND 8 8

Access Path: +-GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.LP_ACCOUNT_ID | +---> GROUPBY PIPELINED [Cost: 7M, Rows: 10K] (PATH ID: 2) | | Group By: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | +---> STORAGE ACCESS for a11 [Cost: 5M, Rows: 199M] (PATH ID: 3) | | | Projection: lp_15744040.FACT_VISIT_ROOM_fix1 | | | Materialize: a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID

Page 15: Vertica mpp columnar dbms

Results …

Elapsed Time First projection GROUPBY HASH (SORT OUTPUT)

Time: First fetch (7 rows): 264527.916 ms. All rows formatted: 264527.978 ms

Elapsed Time Second projection GROUPBY PIPELINED

Time: First fetch (7 rows): 38913.909 ms. All rows formatted: 38913.965 ms

Page 16: Vertica mpp columnar dbms

Join Example select a12.DT_WEEK AS DT_WEEK,

a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,

count(distinct a11.VS_LP_SESSION_ID) AS Visits,

(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1

from zzz.FACT_VISIT a11

join zzz.DIM_DATE_TIME a12

on (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)

where (a11.LP_ACCOUNT_ID in ('57386690')

and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')

group by a12.DT_WEEK,

a11.LP_ACCOUNT_ID

Filter : LP_ACCOUNT_ID, VISIT_FROM_DT_TRUNC Group By : DT_WEEK , LP_ACCOUNT_ID Join: VISIT_FROM_DT_TRUNC , DATE_TIME_ID Select : DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID

Page 17: Vertica mpp columnar dbms

Full Explain Plan… Access Path:

+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID: 1)

| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)

| Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID

| Execute on: All Nodes

| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)

| | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID

| | Execute on: All Nodes

| | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISTICS)] (PATH ID: 3)

| | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)

| | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID

| | | Execute on: All Nodes

| | | +-- Outer -> STORAGE ACCESS for a11 [Cost: 421K, Rows: 372M (NO STATISTICS)] (PATH ID: 4)

| | | | Projection: zzz.FACT_VISIT_b0

| | | | Materialize: a11.VISIT_FROM_DT_TRUNC

| | | | Filter: (a11.LP_ACCOUNT_ID = '57386690')

| | | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (a11.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp))

| | | | Execute on: All Nodes

| | | +-- Inner -> STORAGE ACCESS for a12 [Cost: 1K, Rows: 10K (NO STATISTICS)] (PATH ID: 5)

| | | | Projection: zzz.DIM_DATE_TIME_node0004

| | | | Materialize: a12.DATE_TIME_ID, a12.DT_WEEK

| | | | Filter: ((a12.DATE_TIME_ID >= '2011-09-01 15:28:00'::timestamp) AND (a12.DATE_TIME_ID <= '2011-12-31 12:52:50'::timestamp))

| | | | Execute on: All Nodes

Page 18: Vertica mpp columnar dbms

Explain Plan (substract)… Access Path:l

+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 14M, Rows: 5M (NO STATISTICS)] (PATH ID: 1)

| Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID)

| Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID

| Execute on: All Nodes

| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 6M, Rows: 100M (NO STATISTICS)] (PATH ID: 2)

| | Group By: a12.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID

| | Execute on: All Nodes

| | +---> JOIN HASH [Cost: 944K, Rows: 372M (NO STATISlTICS)] (PATH ID: 3)

| | | Join Cond: (a11.VISIT_FROM_DT_TRUNC = a12.DATE_TIME_ID)

| | | Materialize at Output: a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID

| | | Execute on: All Nodes

Time: First fetch (6 rows): 56654.894 ms. All rows formatted: 56654.988 ms

Page 19: Vertica mpp columnar dbms

Solution one - Functions select week(a11.VISIT_FROM_DT_TRUNC) AS DT_WEEK,

a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,

count(distinct a11.VS_LP_SESSION_ID) AS Visits,

(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1

from zzz.FACT_VISIT a11

where (a11.LP_ACCOUNT_ID in ('57386690')

and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')

group by week(a11.VISIT_FROM_DT_TRUNC),

a11.LP_ACCOUNT_ID;

Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 127, Rows: 1 (STALE STATISTICS)] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: <SVAR>, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 126, Rows: 1 (STALE STATISTICS)] (PATH ID: 2) | | Group By: (date_part('week', a11.VISIT_FROM_DT_TRUNC))::int, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for a11 [Cost: 125, Rows: 1 (STALE STATISTICS)] (PATH ID: 3) | | | Projection: zzz.FACT_VISIT_b0 Time: First fetch (6 rows): 33453.997 ms. All rows formatted: 33454.154 ms

Saved the Join Time

Page 20: Vertica mpp columnar dbms

Solution Two- PreJoin Projection

Pros

• Eliminate Join overhead

• Maintain By Vertica

Cons

• Not Flexible

• Cause Overhead on Load

• Need Primary/Foreign Key

• Maintenance Restrictions

Page 21: Vertica mpp columnar dbms

Access Path:

+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 12K, Rows: 10K] (PATH ID: 1)

| Aggregates: count(DISTINCT visit_date_time_prejoin8_b0.VS_LP_SESSION_ID)

| Group By: visit_date_time_prejoin8_b0.DT_WEEK, visit_date_time_prejoin8_b0.LP_ACCOUNT_ID

| Execute on: All Nodes

| +---> GROUPBY HASH (SORT OUTPUT) [Cost: 11K, Rows: 10K] (PATH ID: 2)

| | Group By: visit_date_time_prejoin8_b0.DT_WEEK, visit_date_time_prejoin8_b0.LP_ACCOUNT_ID, visit_date_time_prejoin8_b0.VS_LP_SESSION_ID

| | Execute on: All Nodes

| | +---> STORAGE ACCESS for <No Alias> [Cost: 8K, Rows: 1M] (PATH ID: 3)

| | | Projection: lp_15744040.visit_date_time_prejoin8_b0

Solution Two- PreJoin Projection order by LP_ACCOUNT_ID,VISIT_FROM_DT_TRUNC,DT_WEEK,HOT_LEAD_IND,DATE_TIME_ID,VS_LP_SESSION_ID

Time: First fetch (6 rows): 35312.331 ms. All rows formatted: 35312.421 ms Saved the Join Time

Page 22: Vertica mpp columnar dbms

Access Path:

+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 542K, Rows: 10K] (PATH ID: 1)

| Aggregates: count(DISTINCT visit_date_time_prejoin_z6.VS_LP_SESSION_ID)

| Group By: visit_date_time_prejoin_z6.DT_WEEK, visit_date_time_prejoin_z6.LP_ACCOUNT_ID

| Execute on: All Nodes

| +---> GROUPBY PIPELINED [Cost: 542K, Rows: 10K] (PATH ID: 2)

| | Group By: visit_date_time_prejoin_z6.DT_WEEK, visit_date_time_prejoin_z6.VS_LP_SESSION_ID, visit_date_time_prejoin_z6.LP_ACCOUNT_ID

| | Execute on: All Nodes

| | +---> STORAGE ACCESS for <No Alias> [Cost: 501K, Rows: 15M] (PATH ID: 3)

| | | Projection: lp_15744040.visit_date_time_prejoin_z6

| |

Solution Two- PreJoin Projection Sorted By DT_WEEK, LP_ACCOUNT_ID, VS_LP_SESSION_ID

Time: First fetch (6 rows): 3680.853 ms. All rows formatted: 3680.969 ms Saved the Join Time and Group by hash Time

Page 23: Vertica mpp columnar dbms

Solution Three - Denormalize select DT_WEEK,

a11.LP_ACCOUNT_ID AS LP_ACCOUNT_ID,

count(distinct a11.VS_LP_SESSION_ID) AS Visits,

(count(distinct a11.VS_LP_SESSION_ID) * 1.0) AS WJXBFS1

from zzz.FACT_VISIT_Z1 a11

where (a11.LP_ACCOUNT_ID in ('57386690')

and a11.VISIT_FROM_DT_TRUNC between '2011-09-01 15:28:00' and '2011-12-31 12:52:50')

group by DT_WEEK,

a11.LP_ACCOUNT_ID;

Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY HASH (SORT OUTPUT) [Cost: 3M, Rows: 10K (NO STATISTICS)] (PATH ID: 2) | | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for a11 [Cost: 2M, Rows: 372M (NO STATISTICS)] (PATH ID: 3) | | | Projection: zzz.FACT_VISIT_Z1_super

Time: First etch (6 rows): 33885.178 ms. All rows formatted: 33885.253 ms Saved the Join Time

Page 24: Vertica mpp columnar dbms

• Changing the projection sort order

Solution Three - Denormalize

Access Path: +-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 588K, Rows: 10K] (PATH ID: 1) | Aggregates: count(DISTINCT a11.VS_LP_SESSION_ID) | Group By: a11.DT_WEEK, a11.LP_ACCOUNT_ID | Execute on: All Nodes | +---> GROUPBY PIPELINED [Cost: 587K, Rows: 10K] (PATH ID: 2) | | Group By: a11.DT_WEEK, a11.VS_LP_SESSION_ID, a11.LP_ACCOUNT_ID | | Execute on: All Nodes | | +---> STORAGE ACCESS for a11 [Cost: 531K, Rows: 20M] (PATH ID: 3) | | | Projection: zzz.fact_visit_z1_pipe | | | Materialize: a11.DT_WEEK, a11.LP_ACCOUNT_ID, a11.VS_LP_SESSION_ID | | | Filter: (a11.LP_ACCOUNT_ID = '57386690') | | | Filter: ((a11.VISIT_FROM_DT_TRUNC >= '2011-09-01 15:28:00'::timestamp) AND (a11.VISIT_FROM_DT_TRUNC <= '2011-12-31 12:52:50'::timestamp)) | | | Execute on: All Nodes Time: First fetch (6 rows): 4313.497 ms. All rows formatted: 4313.600 ms

Saved the Join Time and Group by hash Time

Page 25: Vertica mpp columnar dbms

•Keep it simple

•Keep it sorted.

•Keep it joinless

Let’s sum it up…

Page 26: Vertica mpp columnar dbms

Questions ?

Page 27: Vertica mpp columnar dbms

Thank You