Introduction To Postgres6.148.scripts.mit.edu/2018/pages/lectures/WEBday7_postgre.pdf · • I...

Introduction ToPostgres

Rodrigo Menezes

• I joined in 2013, when we were ~20 people

• Acquired by Oracle during summer of 2017

• Currently, we’re about ~250 people

• I started off as a frontend developer

• Why Postgres?

• Brief introduction to SQL

• ORM (Object relational mapper)

• Transactions

• Performance and indices

• Views and Materialized Views

• JSON and Postgres

• Scaling Postgres @ Moat

This Talk

Why PostgreSQL?

• Free + open source

• Really good community

• Fast bug fixes and frequent release cycle

• Amazing docs

• "Add features slowly, do them well"

• Great performance

• Ton of cool features

• SQL is great and people can pry it from my cold, dead hands

Why PostgreSQL?

What is SQL?

• SQL is a language for inserting, updating, reading and deleting data

• SQL is meant for humans (pros and cons to this)

• SQL is everywhere. Has been since the 70s

• Every collection of data is a "table"

• Every new datapoint is a "row"

SQL

postgres=# CREATE TABLE users(id INTEGER NOT NULL,email TEXT NOT NULL,name TEXT

);CREATE TABLETime: 40.498 ms

Creating a table

postgres=# INSERT INTO users(id, email) VALUES(0, '[email protected]'

);INSERT 0 1Time: 13.561 ms

Inserting a row

postgres=# SELECT id, email, name FROM users WHERE id=0; id | email | name ----+----------------------------+------ 0 | [email protected] | (1 row)

Selecting a row

postgres=# SELECT * FROM users WHERE id=0; id | email | name ----+----------------------------+------ 0 | [email protected] | (1 row)

Selecting all columns

postgres=# select * from users where id=0; id | email | name ----+----------------------------+------ 0 | [email protected] | (1 row)

Case insensitive

postgres=# UPDATE users SET email="[email protected]" WHERE id=0;UPDATE 1Time: 12.690 ms

Updating a row

postgres=# DELETE FROM users WHERE id=0;DELETE 1Time: 9.671 ms

Deleting a row

postgres=# DROP TABLE users;DROP TABLETime: 11.198 ms

Dropping a table

• It's great for human beings

• Standards are nice

• In the majority of cases, you want your data strongly typed

• Less accidents

• More assumptions => more performance / compression

• Compile-time errors in strongly typed languages

Why SQL?

ORMs (Object Relation Mapper)

# pythonimport psycopg2 # Library to talk to Postgres

def create_cursor():conn = psycopg2.connect("dbname=test user=postgres")cursor = conn.cursor()

def update_email(cursor, id, new_email):query= (

'UPDATE users SET email={}''WHERE id = {} LIMIT 1'

).format(id)cursor.execute(query)

You can write a raw query…

...

def update_email(cursor, id, new_email):query= (

'UPDATE users SET email={}''WHERE id = {} LIMIT 1'

).format(id)cursor.execute(query)

update_email(create_cursor(), 1, 'NULL; DROP TABLE user; --`);# `--` is a SQL comment so it ignores the rest of the line

SQL Injection

• Library that handles all your SQL for you

• Handles user input for you to make it secure

• Neater syntax that hides SQL complexity (which is mostly a pro but sometimes a con)

ORMs

// db.jsimport { Sequelize } from 'sequelize';

sequelize = new Sequelize({ database: POSTGRES_DB, dialect: 'postgres', host: POSTGRES_HOSTNAME, password: POSTGRES_PASSWORD, username: POSTGRES_USER,});

Sequelize

// models/User.jsimport { Sequelize } from 'sequelize';

const User = sequelize.define('users', { id: { type: Sequelize.INTEGER, autoIncrement: true, primaryKey: true }, email: { type: Sequelize.TEXT, unique: true }});

Sequelize Model

// services/createUser.jsimport { User } from './models/User.js';

function createUser() { … const user = await User.create({ email }); …}

function getUser() { … const user = await User.find({ email }); …}

Model usage

Transactions andconstraints

postgres=# create table users(id serial primary key,email text not null,money bigint not null default 0

);

Let's make a Venmo clone

postgres=# CREATE TABLE users(id SERIAL NOT NULL,email TEXT NOT NULL

);

-- Now I don't need to specify idspostgres=# INSERT INTO users(email) VALUES (

'[email protected]');

postgres=# SELECT * FROM users; id | email ----+---------------------------- 0 | [email protected] (1 row)

Aside: serials

# Make sure user 1 has enough moneypostgres=# select money from users where id = 1;

# Decrease user 1's moneypostgres=# update users set money=money – 20 where id=1;

# What happens if our webserver disconnects here?

# Increase user 2's moneypostgres=# update users set money=money + 20 where id=2;

Sending $20 from user 1 to user 2

postgres=# begin;

postgres=# select money from users where id = 1;

postgres=# update users set money=money – 20 where id=1;

postgres=# update users set money=money + 20 where id=2;

postgres=# commit;

Transactions

postgres=# begin;



postgres=# abort; -- revert all changes!

Transactions

• Transactions allow the effects of your code to happen in one go

• Achieve ACID compliance

• Atomicity – everything in the transaction happens or none of it does

• Consistency – after a transaction, the database will be in a valid state. Rules always apply.

• Isolation – a newly committed transaction shouldn't affect your current transaction (there are multiple levels of this)

• Durability – once a transaction is committed, it'll remain so even in case of power loss.

Transactions

// wrapView.jsimport sequelize from './db/sequelize';export default function wrapView(fn) {

return (req, res, next) => { sequelize.transaction(t => {

return fn(req, res)}).catch(next);

};}

// server.tsapp.post("/api/user", wrapView(users.create));

Wrap everything in a transaction

-- Assume user 1 has $20




-- Now user 1 has -$20

Race condition




postgres=# create table users(id serial primary key,email text not null,money bigint not null default 0 check (money > 0)

);

Constraints

postgres=# begin;

postgres=# update users set money=money – 20 where id=1;-- will error out if necessary


postgres=# commit;

Race conditionpostgres=# begin;

postgres=# update users set money=money – 20 where id=1;-- will error out if necessary


postgres=# commit;

# pythonimport transactionfrom pyramid import testing

from moatpro.web import add_routes

class BaseTest(object):def setup_method(self, method):

self.config = testing.setUp()add_routes(self.config)transaction.begin()

def teardown_method(self, method):testing.tearDown()transaction.abort()

Transactions make for safe testing

Performance and indices

postgres=# CREATE TABLE users(postgres-# id SERIAL,postgres-# email TEXT NOT NULLpostgres-#);

postgres=# INSERT INTO users(email)postgres-# SELECT 'test_' || id::text || '@email.com'postgres-# FROM generate_series(0, 9999999) id;INSERT 0 10000000Time: 27608.651 ms

postgres=# SELECT * FROM users; id | email ------+-------------------- 1 | [email protected] 2 | [email protected] 3 | [email protected]...

Let's make a lot of fake data

postgres=# select * from users where email='[email protected]'; id | email ----+------------------ 1 | [email protected](1 row)

Time: 907.094 ms

-- Can we make this faster?

How slow do things get?

postgres=# explain select * from users where email='something';QUERY PLAN ------------------------------------------------------------------------------ Gather (cost=1000.00..131603.33 rows=50000 width=36) Workers Planned: 2 -> Parallel Seq Scan on users (cost=0.00..125603.33 rows=20833 width=36) Filter: (email = '[email protected]'::text)(4 rows)

Explain query

postgres=# CREATE INDEX on users(email);CREATE INDEXTime: 20291.741 ms

postgres=# explain select * from users where email='[email protected]'; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using users_email_idx on users (cost=0.56..8.58 rows=1 width=36) Index Cond: (email = '[email protected]'::text)(2 rows)

Indices

postgres=# select * from users where email='[email protected]'; id | email ----+------------------ 1 | [email protected](1 row)

Time: 1.547 ms

-- Went from ~1s to ~1ms! x1000 speed up!

Indices

postgres=# CREATE UNIQUE INDEX ON users(email);

postgres=# INSERT INTO users(email) VALUES('[email protected]');INSERT 0 1Time: 12.001 ms

postgres=# INSERT INTO users(email) VALUES ('[email protected]'); ERROR: duplicate key value violates unique constraint "users_email_idx"DETAIL: Key (email)=([email protected]) already exists.Time: 1.410 ms

Unique index

postgres=# CREATE TABLE users2(id SERIAL,email TEXT -- NOT NULL

);postgres=# CREATE UNIQUE INDEX ON users2(email);

postgres=# INSERT INTO users2(email) VALUES (null);INSERT 0 1Time: 21.365 ms

postgres=# INSERT INTO users2(email) VALUES (null);INSERT 0 1Time: 11.785 ms

Unique indices... wat?

postgres=# CREATE TABLE players(postgres-# id SERIAL PRIMARY KEY,postgres-# team TEXT NOT NULL,postgres-# jersey INTEGER NOT NULL,postgres-# first_name TEXT NOT NULL,postgres-# last_name TEXT NOT NULLpostgres-#);

postgres=# CREATE UNIQUE INDEX ON players(team, jersey);

You can do indices on multiple columns

ETL with Postgres

moat ad search

postgres=# create table impressions(brand_id integer not null,ad_id integer not null,created_at timestamp not null default current_timestamp

);

-- For a specific brandand ad, we want results like this:-- day | brand_id | ad_id | num_impressions

Impressions table

postgres=# create table impressions(postgres-# brand_id int not null,postgres-# ad_id int not null,postgres-# created_at timestamp not null default current_timestamppostgres-#);

postgres=# select created_at::date, brand_id, ad_id, count(id)postgres-# from impressionspostgres-# group by created_at::date, brand_id, ad_idpostgres-# where ad_id = foo and brand = bar;

-- This query is a mouthful and it'd be a pain to type out all the time.

Simple group by

postgres=# create view ad_brand_impressions_per_daypostgres-# select created_at::date as day, brand_id, ad_id, count(id)postgres-# from impressionspostgres-# group by created_at::date, brand_id, ad_idpostgres-# where ad_id = foo and brand = bar;

postgres=# select * from num_impressions_per_daypostgres-# where brand_id=?;

-- What if this is slow?

View

-- What if you want to see the number of ads a brand had in a daterange?

postgres=# select brand_id, count(distinct ad_id)postgres-# from num_impressions_per_daypostgres-# where brand_id=<blah> and day between '2017-01-01' and '2017-01-31'postgres-# group by brand_id;

-- What if this is slow?

Number of creatives per brand?

create materialized view impressions_by_dayasselect

created_at::date as day,brand_id,ad_id, count(id) as num_impressions

from impressionsgroup by created_at::date, hostname_id, ad_id;

create unique index on impressions_by_day(day, brand_id, ad_id);

select * from impressions_by_day where brand_id=foo and ad_id=bar;

Materialized view

# Your ETL is now:refresh materialized view impressions_by_day;

# But refreshes block reads, so if you have a unique index,# you can do:refresh materialized view impressions_by_day concurrently;

Materialized view

JSON and Postgres

latency analytics

• Our data is coming in via JSON because it's coming from a browser

• We don't know how our JSON schema will change

• We're going to have a lot of columns

• In this case, maybe it's fine to use JSON column type

Things to consider

postgres=# create table ad_analysis(id serial primary key,

analysis json not null,created_at timestamp not null default current_timestamp,updated_at timestamp not null default current_timestamp

);postgres=# insert into ad_analysis(analysis) values('{"size": 1234}'::json);postgres=# select analysis->>'size' from ad_analysis; ?column? ---------- 1234(1 row)

JSON

postgres=# SELECT '{"c":0, "a":2,"a":1}'::json;json ------------------------ {"c":0, "a":2,"a":1}(1 row)

postgres=# SELECT '{"c":0, "a":2,"a":1}'::json->>'a'; ?column? ---------- 1(1 row)

postgres=# create index on ad_analysis(analysis);ERROR: data type json has no default operator class for access method "btree"

Normal JSON type isn't great

postgres=# SELECT '{"c":0, "a":2,"a":1}'::jsonb; jsonb ------------------ {"a": 1, "c": 0}(1 row)

Time: 0.833 ms

postgres=# create index on ad_analysis(analysis);CREATE INDEXTime: 30.101 ms

-- Really good performance!

Use JSONB

Scaling Postgres @ Moat

moat analytics

• Measure attention online

• Don't track cookies/ip addresses

• We're a neutral third party that publishers and advertisers use

• We process billions of events a day

What we do

lots of data

• Elmo: Last 33 Days

• Marjory: A few years

• Frackles: All of our historical data

• Decanter: aggregates our data

Databases

decanter

create server other_db foreign data wrapper postgres_fdw options (host 'other_db.moat.co',port '5432',dbname 'other_db'

);create user mapping for public server other_db options (

user 'user',password 'password'

);

create schema foo;import foreign schema foo from server search_rds into foo;

# works like a normal tableselect * from foo.table;

Foreign Data Wrapper

• Great way to talk to other databases

• Can talk to other types of data stores too (there's a mysql_fdw, elasticsearch_fdw, etc).

• Careful – the query planner can be dumb. You can put the results in an mview if you want.

Foreign Data Wrappers

• PostgreSQL instance that doesn't store data

• High CPU, high memory, no storage

• Gets results from the other databases via FDW and aggregates them

• Highly available (pgbouncer)

• If the databases down there change, the client doesn't notice

Decanter

marjory

• We have a lot of data

• We're basically bean-counters ("how many times did X happen?". So, we basically need to filter and sum.

• The rows are very wide and very sparse (a lot of NULLs).

• This makes columnar solutions not work well

• This is very compressable

Our data considerations

• 8-10x compression for our data

• CPU bound instead of IO bound

• Pretty great performance

• We use it for some historical data (a few years)

• Inflexible: we're trading generality for fit to our use case

• Probably not great if there's a lot of rows

Marjory

frackles

• Access ALL our historical data

• Needs to be cheap

• Support a lot of concurrency

• Be able to spin up new instances fast for scale

• Bonus: be faster than RedShift

Frackles

• Our database was just a series of SQLite files in S3…

• We used AWS Lambda to pull those S3 files and read from them…

• We used a Python client to manage all of those lambdas and gave it a simple API…

• We wrote our own FDW to talk to the Python client…

• (I was really skeptical)

What if…

• Surprisingly good performance

• It's dirt cheap – you only pay for S3 and Lambdas

• Completely stateless (almost no ops!)

• Constant overhead

• S3 performance has high variance

• Capped on number of concurrent lambdas

• Better on queries with little aggregation

Frackles

In summary

We’re hiring. A [email protected]

Introduction To Postgres6.148.scripts.mit.edu/2018/pages/lectures/WEBday7_postgre.pdf · • I...

Documents

Transcript of Introduction To Postgres6.148.scripts.mit.edu/2018/pages/lectures/WEBday7_postgre.pdf · • I...