Really Big Elephants: PostgreSQL DW

Really Big Elephants

DataWarehousing

with

PostgreSQL

Josh BerkusMySQL User Conference 2011

Included/ExcludedI will cover:

● advantages of Postgres for DW

● configuration● tablespaces● ETL/ELT● windowing● partitioning● materialized views

● I won't cover:● hardware selection● EAV / blobs● denormalization● DW query tuning● external DW tools● backups &

upgrades

What is a“data warehouse”?

synonyms etc.● Business Intelligence

● also BI/DW● Analytics database● OnLine Analytical Processing

(OLAP)● Data Mining● Decision Support

OLTP vs DW ● many single-row

writes● current data● queries generated

by user activity● < 1s response

times● 0.5 to 5x RAM

● few large batch imports

● years of data● queries generated

by large reports● queries can run for

hours● 5x to 2000x RAM

OLTP vs DW ● 100 to 1000 users● constraints

● 1 to 10 users● no constraints

Why use PostgreSQL for

data warehousing?

Complex QueriesSELECT

CASE WHEN ((SUM(inventory.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out)) <> 0) THEN ROUND((CAST(SUM(changes.sold_and_closed + changes.returned_and_closed) AS numeric) * 100) / CAST(SUM(starting.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out) AS numeric), 5) ELSE 0 END AS "Percent_Sold",

CASE WHEN (SUM(changes.sold_and_closed) <> 0) THEN ROUND(100*((SUM(changes.closed_markdown_units_sold)*1.0) / SUM(changes.sold_and_closed)), 5) ELSE 0 END AS "Percent_of_Units_Sold_with_Markdown",

CASE WHEN (SUM(changes.sold_and_closed * _sku.retail_price) <> 0) THEN ROUND(100*(SUM(changes.closed_markdown_dollars_sold)*1.0) / SUM(changes.sold_and_closed * _sku.retail_price), 5) ELSE 0 END AS "Markdown_Percent",

'0' AS "Percent_of_Total_Sales", CASE WHEN SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) IS NULL THEN 0 ELSE

SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) END AS "Net_Sales_at_Retail", '0' AS "Percent_of_Ending_Inventory_at_Retail", SUM(inventory.closed_on_hand * _sku.retail_price) AS

"Ending_Inventory_at_Retail", "_store"."label" AS "Store", "_department"."label" AS "Department", "_vendor"."name" AS "Vendor_Name"

FROM inventory JOIN inventory as starting ON inventory.warehouse_id = starting.warehouse_id AND inventory.sku_id = starting.sku_id LEFT OUTER JOIN ( SELECT warehouse_id, sku_id, sum(received) as received, sum(transferred_in) as transferred_in, sum(transferred_out) as transferred_out, sum(adjustments) as adjustments, sum(sold) as sold FROM movement WHERE movement.movement_date BETWEEN '2010-08-05' AND '2010-08-19' GROUP BY sku_id, warehouse_id ) as changes ON inventory.warehouse_id = changes.warehouse_id AND inventory.sku_id = changes.sku_id

JOIN _sku ON _sku.id = inventory.sku_id JOIN _warehouse ON _warehouse.id = inventory.warehouse_idJOIN _location_hierarchy AS _store ON _store.id = _warehouse.store_id

AND _store.type = 'Store' JOIN _product ON _product.id = _sku.product_idJOIN _merchandise_hierarchy AS _department

ON _department.id = _product.department_id AND _department.type = 'Department'JOIN _vendor AS _vendor ON _vendor.id = _sku.vendor_id

Complex Queries● JOIN optimization

● 5 different JOIN types● approximate planning for 20+ table joins

● subqueries in any clause● plus nested subqueries

● windowing queries● recursive queries

Big Data Features● big tables partitioning● big databases tablespaces● big backups PITR● big updates binary replication● big queries resource control

Extensibility● add data analysis functionality from

external libraries inside the database● financial analysis● genetic sequencing● approximate queries

● create your own:● data types functions● aggregates operators

Community

● lots of experience with large databases● blogs, tools, online help

“I'm running a partitioning scheme using 256 tables with a maximum of 16 million rows (namely IPv4-addresses) and a current total of about 2.5 billion rows, there are no deletes though, but lots of updates.”

“I use PostgreSQL basically as a data warehouse to store all the genetic data that our lab generates … With this configuration I figure I'll have ~3TB for my main data tables and 1TB for indexes. ”

“I'm running a partitioning scheme using 256 tables with a maximum of 16 million rows (namely IPv4-addresses) and a current total of about 2.5 billion rows, there are no deletes though, but lots of updates.”

Sweet Spot

MySQL

PostgreSQL

DW Database

0 5 10 15 20 25 30

0 5 10 15 20 25 30

DW Databases● Vertica● Greenplum● Aster Data● Infobright● Teradata● Hadoop/HBase

● Netezza● HadoopDB● LucidDB● MonetDB● SciDB● Paraccel

How do I configure PostgreSQL for

data warehousing?

General Setup● Latest version of PostgreSQL● System with lots of drives

● 6 to 48 drives– or 2 to 12 SSDs

● High-throughput RAID● Write ahead log (WAL) on separate disk(s)

● 10 to 50 GB space

separate theDW workloadonto its own

server

Settingsfew connectionsmax_connections = 10 to 40

raise those memory limits!shared_buffers = 1/8 to ¼ of RAMwork_mem = 128MB to 1GBmaintenance_work_mem = 512MB to 1GBtemp_buffers = 128MB to 1GBeffective_cache_size = ¾ of RAMwal_buffers = 16MB

No autovacuumautovacuum = off

vacuum_cost_delay = off

● do your VACUUMs and ANALYZEs as part of the batch load process● usually several of them

● also maintain tables by partitioning

What aretablespaces?

logical data extents● lets you put some of your data on specific

devices / disks

CREATE TABLESPACE 'history_log' LOCATION '/mnt/san2/history_log';

ALTER TABLE history_log TABLESPACE history_log;

tablespace reasons● parallelize access

● your largest “fact table” on one tablespace● its indexes on another

– not as useful if you have a good SAN

● temp tablespace for temp tables● move key join tables to SSD● migrate to new storage one table at a time

What is ETLand how do I do it?

Extract, Transform, Load● how you turn external raw data into

normalized database data● Apache logs → web analytics DB● CSV POS files → financial reporting DB● OLTP server → 10-year data warehouse

● also called ELT when the transformation is done inside the database● PostgreSQL is particularly good for ELT

L: INSERT● batch INSERTs into 100's or 1000's per

transaction● row-at-a-time is very slow

● create and load import tables in one transaction

● add indexes and constraints after load● insert several streams in parallel

● but not more than CPU cores

L: COPY● Powerful, efficient delimited file loader

● almost bug-free - we use it for backup● 3-5X faster than inserts● works with most delimited files

● Not fault-tolerant● also have to know structure in advance● try pg_loader for better COPY

L: COPYCOPY weblog_new FROM '/mnt/transfers/weblogs/weblog-20110605.csv' with csv;

COPY traffic_snapshot FROM 'traffic_20110605192241' delimiter '|' nulls as 'N';

\copy weblog_summary_june TO 'Desktop/weblog-june2011.csv' with csv header;

L: in 9.1: FDW

CREATE FOREIGN TABLE raw_hits ( hit_time TIMESTAMP,page TEXT )

SERVER file_fdwOPTIONS (format 'csv', delimiter ';', filename '/var/log/hits.log');

L: in 9.1: FDW

CREATE TABLE hits_2011041617 ASSELECT page, count(*)FROM raw_hitsWHERE hit_time > '2011-04-16 16:00:00' ANDhit_time <= '2011-04-16 17:00:00'

GROUP BY page;

T: temporary tables

CREATE TEMPORARY TABLEON COMMIT DROP sales_records_june_rollup AS SELECT seller_id, location, sell_date, sum(sale_amount), array_agg(item_id)FROM raw_salesWHERE sell_date BETWEEN '2011-06-01' AND '2011-06-30 23:59:59.999'GROUP BY seller_id, location, sell_date;

in 9.1: unlogged tables● like myISAM without the risk

CREATE UNLOGGED TABLE cleaned_log_importAS SELECT hit_time, pageFROM raw_hits, hit_watermarkWHERE hit_time > last_watermark AND is_valid(page);

T: stored procedures● multiple languages

● SQL PL/pgSQL● PL/Perl PL/Python PL/PHP● PL/R PL/Java● allows you to use exernal data processing

libraries in the database● custom aggregates, operators, more

CREATE OR REPLACE FUNCTION normalize_query ( queryin text )RETURNS TEXT LANGUAGE PLPERL STABLE STRICT AS $f$# this function "normalizes" queries by stripping out constants. # some regexes by Guillaume Smet under The PostgreSQL License.local $_ = $_[0];#first cleanup the whitespace s/\s+/ /g; s/\s,/,/g; s/,(\S)/, $1/g; s/^\s//g; s/\s$//g;#remove any double quotes and quoted text s/\\'//g; s/'[^']*'/''/g; s/''('')+/''/g;#remove TRUE and FALSE s/(\W)TRUE(\W)/$1BOOL$2/gi; s/(\W)FALSE(\W)/$1BOOL$2/gi;#remove any bare numbers or hex numbers s/([^a-zA-Z_\$-])-?([0-9]+)/${1}0/g; s/([^a-z_\$-])0x[0-9a-f]{1,10}/${1}0x/ig;#normalize any IN statements s/(IN\s*)$[\'0x,\s]*$/${1}(...)/ig;#return the normalized queryreturn $_;$f$;

CREATE OR REPLACE FUNCTION f_graph2() RETURNS text AS 'sql <- paste("SELECT id as x,hit as y FROM mytemp LIMIT 30",sep="");str <- c(pg.spi.exec(sql));mymain <- "Graph 2";mysub <- paste("The worst offender is: ",str[1,3]," with ",str[1,2]," hits",sep="");myxlab <- "Top 30 IP Addresses";myylab <- "Number of Hits";pdf(''/tmp/graph2.pdf'');plot(str,type="b",main=mymain,sub=mysub,xlab=myxlab,ylab=myylab,lwd=3);mtext("Probes by intrusive IP Addresses",side=3);dev.off();print(''DONE'');' LANGUAGE plr;

ELT Tips● bulk insert into a new table instead of

updating/deleting an existing table● update all columns in one operation

instead of one at a time● use views and custom functions to simplify

your queries● inserting into your long-term tables should

be the very last step – no updates after!

What's awindowing query?

regular aggregate

windowing function

TABLE events (event_id INT,event_type TEXT,start TIMESTAMPTZ,duration INTERVAL,event_desc TEXT

);

SELECT MAX(concurrent)FROM (SELECT SUM(tally) OVER (ORDER BY start)AS concurrent

FROM (SELECT start, 1::INT as tally

FROM events UNION ALL SELECT (start + duration), -1 FROM events ) AS event_vert) AS ec;

UPDATE partition_name SET drop_month = dropitFROM ( SELECT round_id,

CASE WHEN ( ( row_number() over (partition by team_id order by team_id, total_points) )

<= ( drop_lowest ) ) THEN 0 ELSE 1 END as dropit FROM (

SELECT team.team_id, round.round_id, month_points as total_points,row_number() OVER (

partition by team.team_id, kal.positions order by team.team_id, kal.positions, month_points desc ) as ordinal, at_least, numdrop as drop_lowest

FROM partition_name as rdropJOIN round USING (round_id) JOIN team USING (team_id) JOIN pick ON round.round_id = pick.round_id

and pick.pick_period @> this_periodLEFT OUTER JOIN keep_at_least kal

ON rdrop.pool_id = kal.pool_idand pick.position_id = any ( kal.positions )WHERE rdrop.pool_id = this_pool

AND team.team_id = this_team ) as rankingWHERE ordinal > at_least or at_least is null) as droplow

WHERE droplow.round_id = partition_name .round_id AND partition_name .pool_id = this_pool AND dropit = 0;

SELECT round_id, CASE WHEN ( ( row_number() OVER (partition by team_id

order by team_id, total_points) )<= ( drop_lowest ) )

THEN 0 ELSE 1 END as dropitFROM (

SELECT team.team_id, round.round_id, month_points as total_points,

row_number() OVER (partition by team.team_id,

kal.positions order by team.team_id,

kal.positions, month_points desc ) as ordinal

stream processing SQL● replace multiple queries with a single

query● avoid scanning large tables multiple times

● replace pages of application code● and MB of data transmission

● SQL alternative to map/reduce● (for some data mining tasks)

How do I partition my tables?

Postgres partitioning● based on table inheritance and constraint

exclusion● partitions are also full tables● explicit constraints define the range of the

partion● triggers or RULEs handle insert/update

CREATE TABLE sales (sell_date TIMESTAMPTZ NOT NULL,seller_id INT NOT NULL,item_id INT NOT NULL,sale_amount NUMERIC NOT NULL,narrative TEXT );

CREATE TABLE sales_2011_06 (CONSTRAINT partition_date_rangeCHECK (sell_date >= '2011-06-01'AND sell_date < '2011-07-01' )

) INHERITS ( sales );

CREATE FUNCTION sales_insert ()RETURNS trigger LANGUAGE plpgsql AS $f$BEGIN

CASE WHEN sell_date < '2011-06-01'THEN INSERT INTO sales_2011_05 VALUES (NEW.*)

WHEN sell_date < '2011-07-01'THEN INSERT INTO sales_2011_06 VALUES (NEW.*)

WHEN sell_date >= '2011-07-01'THEN INSERT INTO sales_2011_07 VALUES (NEW.*)

ELSEINSERT INTO sales_overflow VALUES (NEW.*)

END;RETURN NULL;

END;$f$;

CREATE TRIGGER sales_insert BEFORE INSERT ON salesFOR EACH ROW EXECUTE PROCEDURE sales_insert();

Postgres partitioning● Good for:

● “rolling off” data● DB maintenance● queries which use

the partition key● under 300

partitions● insert performance

● Bad for:● administration● queries which do

not use the partition key

● JOINs● over 300 partitions● update

performance

you need a data expiration policy

● you can't plan your DW otherwise● sets your storage requirements● lets you project how queries will run when

database is “full”● will take a lot of meetings

● people don't like talking about deleting data

you need a data expiration policy

● raw import data 1 month● detail-level transactions 3 years● detail-level web logs 1 year● rollups 10 years

What's a materialized view?

query results as table● calculate once, read many time

● complex/expensive queries● frequently referenced

● not necessarily a whole query● often part of a query

● manually maintained in PostgreSQL● automagic support not complete yet

SELECT page, COUNT(*) as total_hits

FROM hit_counterWHERE date_trunc('day', hit_date)BETWEEN ( now() AND now() - INTERVAL '7 days' )

ORDER BY total_hits DESC LIMIT 10;

CREATE TABLE page_hits (page TEXT,hit_day DATE,total_hits INT,CONSTRAINT page_hits_pk PRIMARY KEY(hit_day, page)

);

each day:

INSERT INTO page_hitsSELECT page, date_trunc('day', hit_date)as hit_day,

COUNT(*) as total_hitsFROM hit_counterWHERE date_trunc('day', hit_date)= date_trunc('day',

now() - INTERVAL '1 day')ORDER BY total_hits DESC;

SELECT page, total_hitsFROM page_hitsWHERE hit_date BETWEEN now() ANDnow() - INTERVAL '7 days';

maintaining matviewsBEST: update matviews

at batch load time

GOOD: update matview accordingto clock/calendar

BAD for DW: update matviewsusing a trigger

matview tips● matviews should be small

● 1/10 to ¼ of RAM● each matview should support several

queries● or one really really important one

● truncate + insert, don't update● index matviews like crazy

Contact● Josh Berkus: [email protected]

● blog: blogs.ittoolbox.com/database/soup● PostgreSQL: www.postgresql.org

● pgexperts: www.pgexperts.com● Upcoming Events

● pgCon: Ottawa: May 17-20● OpenSourceBridge: Portland: June

This talk is copyright 2010 Josh Berkus and is licensed under the creative commons attribution license. Special thanks for materials to: Elein Mustain (PL/R), Hitoshi Harada and David Fetter (windowing functions), Andrew Dunstan (file_FDW)

Really Big Elephants: PostgreSQL DW

Technology

Transcript of Really Big Elephants: PostgreSQL DW