PostgreSQL, performance for queries with grouping

PostgreSQLOptimisation of queries with grouping

Alexey Bashtanov, Brandwatch

28 Jan 2016

What is it all about?

This talk will cover optimisation ofGroupingAggregation

Unfortunately it will not cover optimisation ofGetting the dataFilteringJoinsWindow functionsOther data transformations

Outline

1 What is a grouping?

2 How does it work?Aggregation functions under the hoodGrouping algorithms

3 OptimisationAvoiding sortsSummationDenormalized data aggregationArg-maximum

4 Still slow?

What is a grouping?

What do we call a grouping/aggregation operation?

An operation of splitting input data into several classes andthen compilation each class into one row.

332 21 1

Examples

SELECT department_id,avg(salary)

FROM employeesGROUP BY department_id

SELECT DISTINCT department_idFROM employees

Examples

SELECT DISTINCT ON (department_id)department_id,employee_id,salary

FROM employeesORDER BY department_id,

salary DESC

Examples

SELECT max(salary)FROM employees

SELECT salaryFROM employeesORDER BY salary DESCLIMIT 1

How does it work?

Aggregation functions under the hood

INITCOND SFUNC

Input data

state SFUNC

Input data

state SFUNC

Input data

FINALFUNC

Result

An aggregate function is defined by:State, input and output typesInitial state (INITCOND)Transition function (SFUNC)Final function (FINALFUNC)

SELECT sum(column1),avg(column1)

FROM (VALUES (2), (3), (7)) _

state = 0 state += input

2 state += input

5 state += input

sum=12

FROM (VALUES (2), (3), (7)) _

cnt = 0sum = 0

cnt++sum+=input

cnt=1sum=2

cnt++sum+=input

cnt=2sum=5

cnt++sum+=input

cnt=3sum=12

sum / cnt

FROM (VALUES (2), (3), (7)) _

SFUNC and FINALFUNC functions can be written inC — fast (SFUNC may modify input state and return it)SQLPL/pgSQL — SLOW!any other language

SFUNC and FINALFUNC functions can be declared STRICT(i.e. not called on null input)

Grouping algorithms

PostgreSQL uses 2 algorithms to feed aggregate functions bygrouped data:

GroupAggregate: get the data sorted and applyaggregation function to groups one by oneHashAggregate: store state for each key in a hash table

GroupAgg

1 3 1 2 2 3 1 3 2 1 state: 0

1 3 1 2 2 3 1 3 state: 3

1 3 1 2 2 state: 4 6

1 3 1 state: 0 8 6

GroupAgg

1 3 1 2 2 3 1 3 2 1 state: 0

1 3 1 2 2 3 1 3 state: 3

1 3 1 2 2 state: 4 6

1 3 1 state: 0 8 6

GroupAgg

1 3 1 2 2 3 1 3 2 1 state: 0

1 3 1 2 2 3 1 3 state: 3

1 3 1 2 2 state: 4 6

1 3 1 state: 0 8 6

GroupAgg

1 3 1 2 2 3 1 3 2 1 state: 0

1 3 1 2 2 3 1 3 state: 3

1 3 1 2 2 state: 4 6

1 3 1 state: 0 8 6

GroupAgg

1 3 1 2 2 3 1 3 2 1 state: 0

1 3 1 2 2 3 1 3 state: 3

1 3 1 2 2 state: 4 6

1 3 1 state: 0 8 6

HashAggregate

1 2 3 2 3 1 2 1 3 1 state: 0

1 2 3 2 3 1 2 1 3 state: 1

1 2 3 2 3 1 2 1state: 1

state: 3

state: 6

state: 1

state: 6

state: 8state: 5

HashAggregate

1 2 3 2 3 1 2 1 3 1 state: 0

1 2 3 2 3 1 2 1 3 state: 1

1 2 3 2 3 1 2 1state: 1

state: 3

state: 6

state: 1

state: 6

state: 8state: 5

HashAggregate

1 2 3 2 3 1 2 1 3 1 state: 0

1 2 3 2 3 1 2 1 3 state: 1

1 2 3 2 3 1 2 1state: 1

state: 3

state: 6

state: 1

state: 6

state: 8state: 5

HashAggregate

1 2 3 2 3 1 2 1 3 1 state: 0

1 2 3 2 3 1 2 1 3 state: 1

1 2 3 2 3 1 2 1state: 1

state: 3

state: 6

state: 1

state: 6

state: 8state: 5

HashAggregate

1 2 3 2 3 1 2 1 3 1 state: 0

1 2 3 2 3 1 2 1 3 state: 1

1 2 3 2 3 1 2 1state: 1

state: 3

state: 6

state: 1

state: 6

state: 8state: 5

HashAggregate

1 2 3 2 3 1 2 1 3 1 state: 0

1 2 3 2 3 1 2 1 3 state: 1

1 2 3 2 3 1 2 1state: 1

state: 3

state: 6

state: 1

state: 6

state: 8state: 5

GroupAggregate vs. HashAggregate

GroupAggregate− Requires sorted data+ Needs less memory+ Returns sorted data+ Returns data on the fly+ Can perform

count(distinct ...),array_agg(... order by ...)etc.

HashAggregate+ Accepts unsorted data− Needs more memory− Returns unsorted data− Returns data at the end− Can perform only basic

aggregation

Optimisation

Avoiding sorts

Sorts are really slow. Prefer HashAggregation if possible.

What to do if you get something like this?

EXPLAINSELECT region_id,

avg(age)FROM peopleGROUP BY region_id

Increase work_mem: set work_mem to ’100MB’

HashAggregate (cost=20406.00..20530.61 rows=9969 width=10)-> Seq Scan on people (cost=0.00..15406.00 rows=1000000 width=10)

685.689 msIncrease sanely to avoid OOM

Avoiding sorts

GroupAggregate (cost=149244.84..156869.46 rows=9969 width=10)-> Sort (cost=149244.84..151744.84 rows=1000000 width=10)

Sort Key: region_id-> Seq Scan on people (cost=0.00..15406.00 rows=1000000 width=10)

1504.474 ms

Increase work_mem: set work_mem to ’100MB’HashAggregate (cost=20406.00..20530.61 rows=9969 width=10)

-> Seq Scan on people (cost=0.00..15406.00 rows=1000000 width=10)

Avoiding sorts

set enable_sort to off?

No!GroupAggregate (cost=10000149244.84..10000156869.46 rows=9969 width=10)

-> Sort (cost=10000149244.84..10000151744.84 rows=1000000 width=10)Sort Key: region_id-> Seq Scan on people (cost=0.00..15406.00 rows=1000000 width=10)

1497.167 msIncrease work_mem: set work_mem to ’100MB’HashAggregate (cost=20406.00..20530.61 rows=9969 width=10)

Avoiding sorts

set enable_sort to off? No!GroupAggregate (cost=10000149244.84..10000156869.46 rows=9969 width=10)

-> Sort (cost=10000149244.84..10000151744.84 rows=1000000 width=10)Sort Key: region_id-> Seq Scan on people (cost=0.00..15406.00 rows=1000000 width=10)

1497.167 ms

Increase work_mem: set work_mem to ’100MB’HashAggregate (cost=20406.00..20530.61 rows=9969 width=10)

Avoiding sorts

685.689 ms

Increase sanely to avoid OOM

Avoiding sorts

How to spend less memory to allow HashAggregation?

Don’t aggregate joinedSELECT p.region_id,

d.region_description,avg(age)

FROM people pJOIN regions r using (region_id)GROUP BY region_id,

region_description

Join aggregated insteadSELECT a.region_id,

r.region_description,a.avg_age

FROM (SELECT region_id,

avg(age) avg_ageFROM people pGROUP BY region_id

) aJOIN regions r using (region_id)

Avoiding sorts

How to avoid sorts for count(DISTINCT ...)?

SELECT date_trunc(’month’, visit_date),count(DISTINCT visitor_id)

FROM visitsGROUP BY date_trunc(’month’, visit_date)

GroupAggregate (actual time=7685.972..10564.358 rows=329 loops=1)-> Sort (actual time=7680.426..9423.331 rows=4999067 loops=1)

Sort Key: (date_trunc(’month’::text, visit_date))Sort Method: external merge Disk: 107496kB-> Seq Scan on visits (actual time=10.941..2966.460 rows=4999067 loops=1)

Avoiding sorts

Two levels of HashAggregate could be faster!

SELECT visit_month,count(*)

FROM (SELECT DISTINCT

date_trunc(’month’, visit_date)as visit_month,

visitor_idFROM visits

) _GROUP BY visit_month

HashAggregate (actual time=2632.322..2632.354 rows=329 loops=1)-> HashAggregate (actual time=2496.010..2578.779 rows=329000 loops=1)

-> Seq Scan on visits (actual time=0.060..1569.906 rows=4999067 loops=1)

Avoiding sorts

How to avoid sorts for array_agg(...ORDER BY ...)?

SELECTvisit_date,array_agg(visitor_id ORDER BY visitor_id)

FROM visitsGROUP BY visit_date

GroupAggregate (actual time=5433.658..8010.309 rows=10000 loops=1)-> Sort (actual time=5433.416..6769.872 rows=4999067 loops=1)

Sort Key: visit_dateSort Method: external merge Disk: 107504kB-> Seq Scan on visits (actual time=0.046..581.672 rows=4999067 loops=1)

Avoiding sorts

Might be better to sort each line separately

SELECTvisit_date,(

select array_agg(i ORDER BY i)from unnest(visitors_u) i

)FROM (

SELECT visit_date,array_agg(visitor_id) visitors_u

FROM visitsGROUP BY visit_date

Subquery Scan on _ (actual time=2504.915..3767.300 rows=10000 loops=1)-> HashAggregate (actual time=2504.757..2555.038 rows=10000 loops=1)

-> Seq Scan on visits (actual time=0.056..397.859 rows=4999067 loops=1)SubPlan 1

-> Aggregate (actual time=0.120..0.121 rows=1 loops=10000)-> Function Scan on unnest i (actual time=0.033..0.055 rows=500 loops=10000)

Summation

There are three sum functions in PostgreSQL:sum(int) returns bigint

sum(bigint) returns numeric — SLOW(needs to convert every input value)sum(numeric) returns numeric

Do not use bigint as a datatype for a value to be summed,prefer numeric. BTW small numeric numbers spend lessspace bytes on disk than bigint.

It might be worth writing a custom aggregate functionsum(bigint) returns bigint . . .

Summation

Straightforward solution, to be used if there are few zero values:

SELECT sum(cat_cnt)FROM cities

Can speed up up to 7 times. Worth considering if >50% zeroes:

SELECT coalesce(sum(tiger_cnt), 0)FROM citiesWHERE tiger_cnt <> 0

Can help only if the type is numeric and we cannot filter out:

SELECT coalesce(sum(nullif(tiger_cnt, 0)), 0),sum(cat_cnt)

FROM cities

Summation

Better in any case to replace all zeroes by nulls:

UPDATE citiesSET cat_cnt = nullif(cat_cnt, 0),

tiger_cnt = nullif(tiger_cnt, 0);VACUUM FULL cities;

Additionally this will dramatically reduce space occupied.

Denormalized data aggregation

Sometimes we need to aggregate denormalized data

Most common solution is

SELECT account_id,account_name,sum(payment_amount)

FROM paymentsGROUP BY account_id,

account_name

Planner does not know that account_id and account_namecorrelate. It can lead to wrong estimates and suboptimal plan.

A bit less-known approach is

SELECT account_id,min(account_name),sum(payment_amount)

FROM paymentsGROUP BY account_id

Works only if the type of "denormalized payload" supportscomparison operator.

Also we can write a custom aggregate function

CREATE FUNCTION frst (text, text)RETURNS text IMMUTABLE LANGUAGE sql AS

$$ select $1; $$;

CREATE AGGREGATE a (text) (SFUNC=frst,STYPE=text

SELECT account_id,a(account_name),sum(payment_amount)

Or even write it in C

SELECT account_id,anyold(account_name),sum(payment_amount)

Sorry, no source code for anyold

And what is the fastest?

It depends on the width of "denormalized payload":

1 10 100 1000 10000dumb 366ms 374ms 459ms 1238ms 53236ms

min 375ms 377ms 409ms 716ms 16747msSQL 1970ms 1975ms 2031ms 2446ms 2036ms

C 385ms 385ms 408ms 659ms 436ms

* — The more data the faster we proceed?It is because we do not need to extract TOASTed values.

And what is the fastest?

It depends on the width of "denormalized payload":

1 10 100 1000 10000dumb 366ms 374ms 459ms 1238ms 53236ms

min 375ms 377ms 409ms 716ms 16747msSQL 1970ms 1975ms 2031ms 2446ms 2036ms*

C 385ms 385ms 408ms 659ms 436ms*

* — The more data the faster we proceed?It is because we do not need to extract TOASTed values.

Arg-maximum

Population of the largestcity in each countryDate of last tweet by eachauthorThe highest salary in eachdepartment

Arg-maxWhat is the largest city ineach countryWhat is the last tweet byeach authorWho gets the highestsalary in each department

Arg-maximum

Population of the largestcity in each countryDate of last tweet by eachauthorThe highest salary in eachdepartment

Arg-maxWhat is the largest city ineach countryWhat is the last tweet byeach authorWho gets the highestsalary in each department

Arg-maximum

Max is built-in. How to perform Arg-max?Self-joins?Window-functions?

Use DISTINCT ON() (PG-specific, not in SQL standard)

SELECT DISTINCT ON (author_id)author_id,twit_id

FROM twitsORDER BY author_id,

twit_date DESC

But it still can be performed only by sorting, not by hashing :(

Arg-maximum

Max is built-in. How to perform Arg-max?Self-joins?Window-functions?Use DISTINCT ON() (PG-specific, not in SQL standard)

twit_date DESC

Arg-maximum

Max is built-in. How to perform Arg-max?Self-joins?Window-functions?Use DISTINCT ON() (PG-specific, not in SQL standard)

twit_date DESC

Arg-maximum

We can emulate Arg-max by ordinary max and dirty hacks

SELECT author_id,(max(array[

twit_date,date’epoch’ + twit_id

]))[2] - date’epoch’FROM twitsGROUP BY author_id;

But such types tweaking is not always possible.

Arg-maximum

It’s time to write more custom aggregate functionsCREATE TYPE amax_ty AS (key_date date, payload int);

CREATE FUNCTION amax_t (p_state amax_ty, p_key_date date, p_payload int)RETURNS amax_ty IMMUTABLE LANGUAGE sql AS

$$SELECT CASE WHEN p_state.key_date < p_key_date

OR (p_key_date IS NOT NULL AND p_state.key_date IS NULL)THEN (p_key_date, p_payload)::amax_tyELSE p_state END

CREATE FUNCTION amax_f (p_state amax_ty) RETURNS int IMMUTABLE LANGUAGE sql AS$$ SELECT p_state.payload $$;

CREATE AGGREGATE amax (date, int) (SFUNC = amax_t,STYPE = amax_ty,FINALFUNC = amax_f,INITCOND = ’(,)’

SELECT author_id,amax(twit_date, twit_id)

FROM twitsGROUP BY author_id;

Arg-maximum

Argmax is similar to amax, but written in C

SELECT author_id,argmax(twit_date, twit_id)

FROM twitsGROUP BY author_id;

Arg-maximum

Who wins now?

1002 3332 10002 33332 50002

DISTINCT ON 6ms 42ms 342ms 10555ms 30421msMax(array) 5ms 47ms 399ms 4464ms 10025msSQL amax 38ms 393ms 3541ms 39539ms 90164msC argmax 5ms 37ms 288ms 3183ms 7176ms

SQL amax finally outperforms DISTINCT ON on 109-ish rows

Arg-maximum

Who wins now?

1002 3332 10002 33332 50002

SQL amax finally outperforms DISTINCT ON on 109-ish rows

Still slow?

Slow max, arg-max or distinct query?Sometimes we can fetch the rows one-by-one using index:

3 2 1 4 2 2 1 3 31 0CREATE INDEX ON twits(author_id, twit_date DESC);

-- for the very first author_id fetch the row with latest dateSELECT twit_id,

twit_date,author_id

twit_date DESCLIMIT 1;

-- find the next author_id and fetch the row with latest dateSELECT twit_id,

twit_date,author_id

FROM twitsWHERE author_id > ?ORDER BY author_id,

CREATE INDEX ON twits(author_id, twit_date DESC);

CREATE FUNCTION f1by1() RETURNS TABLE (o_twit_id int, o_twit_date date) AS $$DECLARE l_author_id int := -1; -- to make the code a bit more simpleBEGIN

LOOPSELECT twit_id,

twit_date,author_id

INTO o_twit_id,o_twit_date,l_author_id

FROM twitsWHERE author_id > l_author_idORDER BY author_id,

EXIT WHEN NOT FOUND;RETURN NEXT;

END LOOP;END;$$ LANGUAGE plpgsql;

SELECT * FROM f1by1();

Still slow?

Slow max, arg-max or distinct query?Sometimes we can fetch the rows one-by-one using index:

3 2 1 4 2 2 1 3 31 0CREATE INDEX ON twits(author_id, twit_date DESC);

CREATE FUNCTION f1by1() RETURNS TABLE (o_twit_id int, o_twit_date date) AS $$DECLARE l_author_id int := -1; -- to make the code a bit more simpleBEGIN

LOOPSELECT twit_id,

twit_date,author_id

INTO o_twit_id,o_twit_date,l_author_id

FROM twitsWHERE author_id > l_author_idORDER BY author_id,

EXIT WHEN NOT FOUND;RETURN NEXT;

END LOOP;END;$$ LANGUAGE plpgsql;

SELECT * FROM f1by1();

Still slow?

Let us use pure SQL instead, it is a bit faster as usualWITH RECURSIVE d AS (

(SELECT array[author_id, twit_id] idsFROM twitsORDER BY author_id,

twit_date DESCLIMIT 1

)UNION

SELECT (SELECT array[t.author_id, t.twit_id]FROM twits tWHERE t.author_id > d.ids[1]ORDER BY t.author_id,

t.twit_date DESCLIMIT 1

) qFROM d

)SELECT d.ids[1] author_id,

d.ids[2] twit_idFROM d;

Still slow?

One-by-one retrieval by index+ Incredibly fast unless returns too many rows− Needs an index− SQL version needs tricks if the data types differ

Authors × Twits-per-author:106 × 101 105 × 102 104 × 103 102 × 105

C argmax 3679ms 3081ms 2881ms 2859ms1-by-1 proc 12750ms 1445ms 152ms 2ms1-by-1 SQL 6250ms 906ms 137ms 2ms

1002 3332 10002 33332 50002

1-by-1 proc 2ms 6ms 12ms 42ms 63ms1-by-1 SQL 1ms 4ms 11ms 29ms 37ms

Still slow?

One-by-one retrieval by index+ Incredibly fast unless returns too many rows− Needs an index− SQL version needs tricks if the data types differ

1002 3332 10002 33332 50002

1-by-1 proc 2ms 6ms 12ms 42ms 63ms1-by-1 SQL 1ms 4ms 11ms 29ms 37ms

Still slow?

Slow HashAggregate?

Use parallel aggregation extension:http://www.cybertec.at/en/products/agg-parallel-aggregations-postgresql/

+ Up to 30 times faster+ Speeds up SeqScan as well− Mostly useful for complex row operations− Requires PG 9.5+− No magic: it loads up several of your cores

Still slow?

Slow count(DISTINCT ...)?

Use HyperLogLog: reliable and efficient approximate algorithmhttps://en.wikipedia.org/wiki/HyperLogLog

https://github.com/aggregateknowledge/postgresql-hll

Or fetch approximate values from pg_stats

Still slow?

Slow in typing? ;)

SELECT department_id,avg(salary)

FROM employeesGROUP BY 1 -- same as GROUP BY department_id

SELECT count(*)FROM employeesGROUP BY true -- same as HAVING count(*) > 0

-- or use MySQLSELECT account_id,

account_name,sum(payment_amount)

FROM paymentsGROUP BY 1

Questions?

PostgreSQL, performance for queries with grouping

Data & Analytics

Transcript of PostgreSQL, performance for queries with grouping

PGDay UK 2016 -- Performace for queries with grouping

Simplifying Learning Analytics Using SQL Queries · 2020-01-28 · Where to run SQL Queries? Configurable Reports block (plugin) PgAdmin (PostgreSQL environment) SQL Server Management

HTAP Queries & Data Fabrics - PgDU 2019 and Da… · HTAP Queries & Data Fabrics Atif Rahman @mantaq10 7th December, 2018 PostgreSQL Down Under Melbourne, Australia. The agenda OLTP

Tips and Tricks for faster SQL queries - PostgreSQL...PostgreSQL 9.5.7-bash-4.3$ pgbench -c20 -T300 -j4 -f /tmp/subquery mydb -p5432 transaction type: /tmp/subquery scaling factor:

The Use of Ability Grouping and Flexible Grouping within ... · PDF fileThe Use of Ability Grouping and Flexible Grouping within ... reading and flexible grouping within guided reading

Scaling Hardware Accelerated Network Monitoring to ...ing entire queries to the PFE, *Flow places parts of the select and grouping logic that are common to all queries into a match+action

Column-Oriented Database Implementation in PostgreSql for ... · Column-Oriented Database Implementation in PostgreSql for Improving Performance of Read-Only Queries has been successfully

Commvault v11 PostgreSQL iDataAgent 構成⼿順 · 7 PostgreSQL のインストール / 構成 PostgreSQL 関係のパッケージのインストール PostgreSQL iDataAgent を使⽤するには、postgresql、postgresql-libs、postgresql-server

Avoiding Sorting and Grouping In Processing Queries Sahak Maloyan

SQL - teach.cs.toronto.educsc343h/winter/slides/SQL-DML.pdf · • DML (Data Manipulation Language), for writing queries and modifying the database. 2. PostgreSQL • We’ll be working

New algorithms for join and grouping operations · supports grouping (“group by” queries in SQL) and du-plicate elimination. For unsorted inputs, candidate output records take

SUBQUERY PLAN REUSE BASED QUERY OPTIMIZATIONcomad/2011/images... · Performance Study Conclusion Future Work References. Query optimization ... PostgreSQL Even for star queries, density

PostGIS Replication FOSS4G 2011 - WordPress.com...PostGIS Replication FOSS4G 2011 Steve Singer ssinger_pg@sympatico.ca ... Versions of PostgreSQL, PostGIS must match Long running queries

Performing Grouping and Aggregate Functions in XML Queries

Migrating Oracle queries to PostgreSQL · Migrating Oracle queries to PostgreSQL PGConf.EU 2012 Prague, Czech Republic October 26th Alexey Klyukin, Command Prompt, Inc.

SQL Programming 1 - Wikispaces 1 SQL Programming 1 The SELECT-FROM-WHERE Structure Single Relation Queries ORDER BY LIKE IS (NOT) NULL DISTINCT Aggregation Queries Grouping Having

PostgreSQL 9.0に対応！ PowerGresのご紹介€¦ · PostgreSQL 7.3 PowerGres V1 PostgreSQL 7.4 PowerGres V2 PostgreSQL 8.0 PowerGres V3 PostgreSQL 8.1 PowerGres V4 PostgreSQL

PostgreSQL, PostgreSQL monitoring and monitoring postgresql · PostgreSQL, PostgreSQL monitoring and monitoring postgresql.org ... stefan@kaltenbrunner.cc Nagios conference 2008 ...

PostgreSQL: present and near futuredownloads.tryton.org/TUB2016/postgresql.pdf · Logical replication – pglogical in core ... 3.3/Parallel queries Amcheck – tool to check the

目录 - pic.mairuan.com...PostgreSQL ੯ତி 93 PostgreSQL ઃ֤ 93 PostgreSQL 函数 94 PostgreSQL ৽ 96 PostgreSQL 转换 97 PostgreSQL 域 98 PostgreSQL ব引 99 PostgreSQL