TPC-H Column Store and MPP systems

TPC-H Performance MPP & Column Store

What is TPCH The TPC Benchmark™H (TPC-H) is a decision support benchmark. It

consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance while maintaining a sufficient degree of ease of implementation. This benchmark illustrates decision support systems that Examine large volumes of data; Execute queries with a high degree of complexity; Give answers to critical business questions.

The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream and the query throughput when queries are submitted by multiple concurrent users.

Overview

TPC-H Schema overview

TPC-H Performance measurements

Partner engagement

TPC-H where is it today

TPC-H challenges

Looking ahead

Q&A

TPC-H Schema overview: Relationships between columns

TPC-H Schema overview : MPP data distribution

Table Column Node 1 Node 2 Node 3

LINEITEM

ORDERKEY 1 2 3

PARTKEY 6 4 8

SUPPKEY 3 18 5

ORDERSORDERKEY 1 2 3

CUSTKEY 4 2 9

PARTSUPPPARTKEY 1 2 3

SUPPKEY 4 5 6

PART PARTKEY 1 2 3

CUSTOMER CUSTKEY 1 2 3

SUPPLIER SUPPKEY 1..N 1..N 1..N

NATION NATIONKEY 1..N 1..N 1..N

REGION REGIONKEY 1..N 1..N 1..N

Collocated

Over network data movement

Collocated Over network data movement

Table Distribution columnLINEITEM L_ORDERKEYORDERS O_ORDERKEYPARTSUPP PS_PARTKEYPART P_PARTKEYCUSTOMER C_CUSTKEYSUPPLIER REPLICATEDNATION REPLICATEDREGION REPLICATED

TPC-H Schema : Metrics Power:

Run order RF1 (Inserts into LINEITEM and ORDERS) 22 read only queries RF2 (Deletes from LINEITEM & ORDERS)

Metric : Query per hour rate TPC-H Power@Size = 3600 * SF / Geomean(22

queries , RF1, RF2) Geometric mean of all queries results in a run Performance improvements to any query equally

improves the metric

Throughput: Run orders

N concurrent Power query streams with different parameters

N RF1 & RF2 streams, this can be run in parallel with the concurrent streams above or after

Metric : Ratio of the total number of queries executed

over the length of the measurement interval TPC-H Throughput@Size = (S*22*3600)/Ts *SF Absolute runtime matters, optimizing for the

longest running query helps

Throughput Power

Run in Parallel

Query Stream 01

Refresh function 1 Inserts into

LINEITEM & ORDERS

Query Stream 02

Query stream 00 14,2,9,20,6…5,7,12

…

Query Stream N

Refresh function 2 Deletes from

LINEITEM & ORDERS

Refresh streams with N pairs of

RF1 & 2

Scale Factor Number of streams100 5300 6

1000 73000 8

10000 930000 10

100000 11

Outline



Partner engagement


TPC-H challenges

Looking ahead

Q&A

TPC-H Performance measurements Invest in tools to analyze

plans, some consider plan analysis an art, breaking down the plan to key metrics helps a lot

Capture enough information in the execution plan to unveil performance issues: Estimate Vs. Actual number of

rows etc.. Amount of data spilled per disk Rows touched Vs. rows qualified

during scan Logical Vs. Physical reads CPU & Memory consumed per

plan operator Skew in number of rows

processed per thread per operator

Instrument the code to provide cycles per row for key scenarios: Scan Aggregate Join

Set performanc

e goals

Measure Performanc

e

Start looking at SMP & MPP

plans

Check CPU & IO

utilization

Fix performanc

e issues

Repeat

TPC-H Performance measurements Scalability within a single server

Vary the number of processors Vary scale factor : 100G, 300G Identify queries that don’t have linear scaling Capture:

CPU & IO utilization per query with at least 1 second sampling rate

Capture hot functions and waits if any Capture CPI ideally per function Capture execution plans

Get busy crunching the data

Scalability across multiple servers Vary the number of servers in the systems Vary amount of data per server Capture:

CPU , disk & network IO Distributed plans Look for queries that have excessive cross node

traffic Identify suboptimal plans where

predicates/aggregates are not pushed down

More focused performance effort

MPP scaling

Data scaling

SMP scaling

Outline



Partner engagement


TPC-H challenges

Looking ahead

Q&A

Partner engagements Can be considered as one of the secret sauces for highly

performing software

Partners (HW/Infrastructure) tend to have vested interest in showcasing Performance and Scalability of their products.

Allows software companies to leverage HW expertise and provide access to low level tools that are not publically available (Through NDA).

Partners occasionally provide HW for Performance benchmarks, prototype evaluation, release publications

Partners can be a great assist for : Providing low level analysis Collaborate in publications, benchmarks, proof of concepts etc.. Provide HW for Performance testing, evaluation, improvement

(large scale experiments are expensive)

Partner engagements NVRAM: Random-access memory that retains its information

when power is turned off (non-volatile). This is in contrast to dynamic random-access memory (DRAM)

“Promises”: Latency within the same order of magnitude of DRAM Cheaper than SSDs +10TB of NVRAM in a 2-socket system within the next 4 years Still in prototype phase Could eliminates need for spinning disks or SSDs altogether

In-memory database are likely to be early adopters of such technology

Good reading: http://research.microsoft.com/en-us/events/trios/trios13-final5.pdf http://www.hpl.hp.com/techreports/2013/HPL-2013-78R1.pdf

http://research.microsoft.com/en-us/events/trios/trios13-final5.pdf

http://research.microsoft.com/en-us/events/trios/trios13-final5.pdf

http://www.hpl.hp.com/techreports/2013/HPL-2013-78R1.pdf

http://www.hpl.hp.com/techreports/2013/HPL-2013-78R1.pdf

Partner engagements Diablo technologies SSD in DRAM slothttp://www.diablo-technologies.com/

http://www.diablo-technologies.com/


Partner engagements Diablo technologies SSD in DRAM slot

DIMM capacity of 200GB & 400GB, technology is rebranded by IBM and VmWare Readyhttp://www.diablo-technologies.com/



Outline



Partner engagement


TPC-H challenges

Looking ahead

Q&A

TPC-H where is it today Why do benchmarks?

Stimulate technological advancements

Why TPCH? Introduce a set of technological challenges whose resolution will significantly improve the

performance of the product

As benchmark is it relevant to current DW applications ?

Gartner Magic quadrant references:

“Vectorwise delivered leading 1TB non-clustered TPC Benchmark H (TPC-H) results in 2012”

Big players are Oracle, Vectorwise, Microsoft, Exasol and Paraccel

Most significant innovation came from: Kickfire acquired by Teradata, FPGA-based "Query Processor Module” with an instruction set tuned

for database operations ParAccel acquired by Actian, shared-nothing architecture with a columnar orientation, adaptive

compression, memory-centric design Exasol .. column-oriented way and proprietary InMemory compression methods are used, database

also has automatic self optimization (create indexes, stats , distribute tables etc.. )

So where does it come in handy? Identify system bottlenecks Push performance focused features into the product TPC-H schema is heavily used for ETL and virtualization benchmarks Introduces lots of interesting challenges to the DMBS

What about TPC-DS, it has a more realistic ETL process , snow flake schema, but no one has published a TPC-DS benchmark yet


Number of publications is on the decline

99 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013

Number of publications

9 1 5 12 31 15 42 31 20 13 15 10 20 5 6

2.5

7.5

12.5

17.5

22.5

27.5

32.5

37.5

42.5

Number of TPCH publications per year

Num

ber

of

publicati

ons

• First cloud based benchmark? When will we see this?

Outline



Partner engagement


TPC-H challenges

Looking ahead

Q&A

TPC-H challenges : Aggregation Almost all TPCH queries do aggregation

Unless there is a sorted index (B-tree) on group by column aggregating in Hash table makes most sense opposed to ordered aggregation

Correctly sizing the hash table dictates performance If cardinality under estimates number of distinct values lots of chaining occurs

and HT can eventually spill to disk. If CE overestimates resources are not used optimally

For low distinct count doing hash table per thread (local) then doing a global aggregation improves performance

For small group by on strings, present group by expressions as integers (index in array) opposed to using a hash table (Reduce cache footprint)

For group by on Primary key (C_CUSTKEY) no need to include other columns from CUSTOMER in the Hash table

Main benefits from PK/FK is aggregate optimizations

Queries sensitive to aggregation performance: 1, 3, 4, 10, 13, 18, 20, 21

TPC-H challenges : Aggregation

Q1Reduces 6 billion rows

to 4

Sensitive to string matching

Benefits from doing local aggregation

Q10Group by on most Customer columns

If PK on C_CUSTKEY exists could use C_CUSTKEY for

aggregation

Further optimization push down of aggregate on

O_CUSTKEY and TOP

18Group by on

L_ORDERKEY results in 1.5 billion rows (4x

reduction)

Local aggregation usually hurts performance

Hash table for aggregation alone can

take 25GB of RAM

TPC-H challenges : JoinsSelect a schema which leverages locality

Examples : ORDERS x LINEITEM on L_ORDERKEY=O_ORDERKEY by hash partitioning on ORDERKEY

Q5,Q9,Q18 can spill and have bad performance if the correct plan is not picked

Q9 will cause over the network communication for MPP systems, unless PARTSUPP, PART and SUPPLIER are replicated which is not feasible for large scale factors

TPCH joins are highly selective, hence efficient bloom filters are necessary

Simplistic guide : Find the most selective filter/aggregation and this is where you start

TPC-H challenges : Expression evaluation

Arithmetic operation

performance

Store decimals as integers and save some bits

19123 Vs. 191.23

Rebase of some of the columns to use less bits

Keep data in the most compact form to best exploit SIMD instructions

Detecting common

sub expressions

sum(l_extendedprice) as

sum_base_price,

sum(l_extendedprice*(1-l_discount)) as sum_disc_price,

sum(l_extendedprice*(1-

l_discount)*(1+l_tax)) as

sum_charge,

Expression filter push down (Q7,

Q19)

Q7 Take the superset or

UNION of filters and push down

to the scan

Q19 Take the union of

individual predicates

Column projection

vs expression evaluation

Cardinality estimates

should help decide to Project

columns A & B or or (A *

(1 - B) ) before a filter

on C

TPC-H challenges : Correlated subqueries

Push down of predicates into subquery when applicable

When sub queries are flattened batch processing outperforms row by row

Buffer overlapped intermediate results

Partial query reuse

Challenging for MPP systems (don’t redistribute or shuffle the same data twice)

TPC-H challenges : Parallelism and concurrency

Current 2P servers have +48 cores, +½ TB of RAM & +10GB/sec of disk IO BW, this means that within a single box the engine needs to provide meaningful scaling

Further sub-partitioning data on a single server alleviates single server scaling problems

TPC-H queries tend to use lots of workspace memory for Joins and aggregations.

Precise and dynamic memory allocation keeps queries from spilling to under high concurrency

TPC-H challenges : Scan performance

Disk read performance is crucial, should validate that when system is not CPU bound IO subsystem is efficiently used.

Ability to filter out pages or segments from the scan is crucial

In memory scan performance can be increased if we decrease the search scope and thereby the amount of data that needs to be streamed from main memory to the CPU

TPC-H challenges : Scan performance

Store dictionaries in sorted order or in a BST to make• Compress the filter or predicate to do numeric

comparison opposed to decompress and match on strings

• Quickly validates if the value exists in the segment

TPC-H challenges : scan performance What do we do for highly selective filters?

Implement paged indexes for columns of interest

Partition a column into pages, store bitmap indices for each compressed value, bits reflect which rows have the respective value, instead of scanning the entire segment for the matching row , we only read the block which has the matching values aka bits set. http://db.disi.unitn.eu/pages/VLDBProgram/pdf/IMDM/paper2.pdf

http://db.disi.unitn.eu/pages/VLDBProgram/pdf/IMDM/paper2.pdf

http://db.disi.unitn.eu/pages/VLDBProgram/pdf/IMDM/paper2.pdf

In MPP a single SQL statement results in multiple SQL statements that get executed locally on each node

Some TPCDS queries can result in +20 SQL statements that need be executed on each leaf node locally

Steaming of data should result in better performance but there are cases when this strategy fails.

Placing data on disk after each steps allows the Query optimizer to reevaluate the plan

TPC-H challenges : Intermediate steps in MPP

Query :

Select count(*) from PART, PARTSUPP , LINEITEM where P_BRAND=“NIKE” and PS_COMMENT like “%bla%” and P_PARTKEY=PS_PARTKEY and L_PARTKEY = PS_PARTKEY group by P_BRAND

Schema : PART distributed on P_PARTKEY PARTSUPP distributed on PS_PARTKEY LINEITEM distributed on L_ORDERKEY

Create bloom filters BF1 on PART, push filter on PARTSUPP and create BF2 , replicate bloom filter on all leaf nodes apply filter on LINEITEM and only shuffle qualifying rows on

Optimizer should chose between semi join reduction and replicating PART x PARTSUPP

Multiple copies of a set of columns distributed differently can improve performance of such issue but at high cost.

TPC-H challenges : Improving join performance for incompatible joins

Outline



Partner engagement


TPC-H challenges

Looking ahead

Q&A

SQL to map reduce jobs? Crunching data in relational database is always faster than HADOOP, bring data from HADOOP into columnar format , perform analytics with efficient generated code

Full integration with analytics tools as SAS , R , Tableau , Excel etc…

Support PL/SQL syntax (Oracle Compete)

Eliminate the aggregating node to reduce system cost for a small number of nodes, Exasol does it.

Looking ahead

Competitive analysis

Exasol 1TB 240 threads, 20 pro-

cessors


cessors


cessors

MemSql 83GB 480 threads, 40

sockets

Ms SqlServer 10TB, 160

threads, 8 pro-cessors

Oracle 11c 10TB, 512 threads, 4

processors

Sec/GB/Thread

1.44 1.4592 1.504 46.66544784 8.1152 40.68864

2.5

7.5

12.5

17.5

22.5

27.5

32.5

37.5

42.5

47.5

TPCH Q1 analysis Sec/GB/Thread (Lower is better)Assuming all processors have the same speed!!!!

Sec/G

B/T

hre

ad

Referances:• http://www.tpc.org/tpch/results/tpch_perf_results.asp• http://www.esg-global.com/lab-reports/memsqle28099s-distributed-

in-memory-database/

http://www.tpc.org/tpch/results/tpch_perf_results.asp

http://www.tpc.org/tpch/results/tpch_perf_results.asp

http://www.esg-global.com/lab-reports/memsqle28099s-distributed-in-memory-database/



AppendinxGMQ 2013

http://www.gartner.com/technology/reprints.do?id=1-1DU2VD4&ct=130131&st=sb

GMQ 2014

http://www.gartner.com/technology/reprints.do?id=1-1M9YEHW&ct=131028&st=sb







TPC-H column store Avoid virtual function calls, branching use templates

Scan usually dominates CPU profile

Vector/Batch processing is a must

If done correctly code is very sensitive to branching, data dependency, exploit instruction parallelism when possible

Use SIMD instructions , leverage already existing libraries to encapsulate SSE instructions complexity // define and initialize integer vectors a and b Vec4i a(10,11,12,13); Vec4i b(20,21,22,23); // add the two vectors Vec4i c = a + b;

http://www.agner.org/optimize/vectorclass.pdf

http://www.agner.org/optimize/vectorclass.pdf

TPC-H PlansBehold the power of the optimizer

If plan is wrong you are doomed… Very good read for TPCH Q8

http://www.slideshare.net/GraySystemsLab/pass-summit-2010-keynote-david-dewitt



JSON documentsMost efficient way to store Json documents

Great compression and quick retrieval, ask me how to ….

Q1

Used as benchmark for computational power

Arithmetic operation performance

Aggregating to same hash buckets

Common sub expressions pattern matching

Scan performance sensitive

String matching for aggregation (Could do matching on compressed format)

select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '[DELTA]' day (3) group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus;

Challenges

Q2

Correlated sub query

Push down of predicates to the correlated subquery

Highly selective (Segment size plays a big role)

Tricky to generate optimal plan

Depending on which tables are partitioned and which are replicated, plan performance varies a lot.

select s_acctbal,s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = [SIZE] and p_type like '%[TYPE]' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = '[REGION]' and ps_supplycost = ( select from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = '[REGION]' ) order by s_acctbal desc, n_name, s_name, p_partkey;

Challenges

Q3

Collocated join between orders & lineitem

Detect correlation between shipdate, orderdat

Bitmap filters on lineitem are necessary

Replicating (select c_custkey from customers where c_mktsegment = ‘[SEGMENt]’)

select TOP 10 l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = '[SEGMENT]' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date '[DATE]' and l_shipdate > date '[DATE]' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate;

Challenges

TPC-H Column Store and MPP systems

Education

Transcript of TPC-H Column Store and MPP systems