Introduction To Postgres6.148.scripts.mit.edu/2018/pages/lectures/WEBday7_postgre.pdf · • I...
Transcript of Introduction To Postgres6.148.scripts.mit.edu/2018/pages/lectures/WEBday7_postgre.pdf · • I...
Introduction ToPostgres
Rodrigo Menezes
• I joined in 2013, when we were ~20 people
• Acquired by Oracle during summer of 2017
• Currently, we’re about ~250 people
• I started off as a frontend developer
• Why Postgres?
• Brief introduction to SQL
• ORM (Object relational mapper)
• Transactions
• Performance and indices
• Views and Materialized Views
• JSON and Postgres
• Scaling Postgres @ Moat
This Talk
Why PostgreSQL?
• Free + open source
• Really good community
• Fast bug fixes and frequent release cycle
• Amazing docs
• "Add features slowly, do them well"
• Great performance
• Ton of cool features
• SQL is great and people can pry it from my cold, dead hands
Why PostgreSQL?
What is SQL?
• SQL is a language for inserting, updating, reading and deleting data
• SQL is meant for humans (pros and cons to this)
• SQL is everywhere. Has been since the 70s
• Every collection of data is a "table"
• Every new datapoint is a "row"
SQL
postgres=# CREATE TABLE users(id INTEGER NOT NULL,email TEXT NOT NULL,name TEXT
);CREATE TABLETime: 40.498 ms
Creating a table
postgres=# INSERT INTO users(id, email) VALUES(0, '[email protected]'
);INSERT 0 1Time: 13.561 ms
Inserting a row
postgres=# SELECT id, email, name FROM users WHERE id=0; id | email | name ----+----------------------------+------ 0 | [email protected] | (1 row)
Selecting a row
postgres=# SELECT * FROM users WHERE id=0; id | email | name ----+----------------------------+------ 0 | [email protected] | (1 row)
Selecting all columns
postgres=# select * from users where id=0; id | email | name ----+----------------------------+------ 0 | [email protected] | (1 row)
Case insensitive
postgres=# UPDATE users SET email="[email protected]" WHERE id=0;UPDATE 1Time: 12.690 ms
Updating a row
postgres=# DELETE FROM users WHERE id=0;DELETE 1Time: 9.671 ms
Deleting a row
postgres=# DROP TABLE users;DROP TABLETime: 11.198 ms
Dropping a table
• It's great for human beings
• Standards are nice
• In the majority of cases, you want your data strongly typed
• Less accidents
• More assumptions => more performance / compression
• Compile-time errors in strongly typed languages
Why SQL?
ORMs (Object Relation Mapper)
# pythonimport psycopg2 # Library to talk to Postgres
def create_cursor():conn = psycopg2.connect("dbname=test user=postgres")cursor = conn.cursor()
def update_email(cursor, id, new_email):query= (
'UPDATE users SET email={}''WHERE id = {} LIMIT 1'
).format(id)cursor.execute(query)
You can write a raw query…
...
def update_email(cursor, id, new_email):query= (
'UPDATE users SET email={}''WHERE id = {} LIMIT 1'
).format(id)cursor.execute(query)
update_email(create_cursor(), 1, 'NULL; DROP TABLE user; --`);# `--` is a SQL comment so it ignores the rest of the line
SQL Injection
• Library that handles all your SQL for you
• Handles user input for you to make it secure
• Neater syntax that hides SQL complexity (which is mostly a pro but sometimes a con)
ORMs
// db.jsimport { Sequelize } from 'sequelize';
sequelize = new Sequelize({ database: POSTGRES_DB, dialect: 'postgres', host: POSTGRES_HOSTNAME, password: POSTGRES_PASSWORD, username: POSTGRES_USER,});
Sequelize
// models/User.jsimport { Sequelize } from 'sequelize';
const User = sequelize.define('users', { id: { type: Sequelize.INTEGER, autoIncrement: true, primaryKey: true }, email: { type: Sequelize.TEXT, unique: true }});
Sequelize Model
// services/createUser.jsimport { User } from './models/User.js';
function createUser() { … const user = await User.create({ email }); …}
function getUser() { … const user = await User.find({ email }); …}
Model usage
Transactions andconstraints
postgres=# create table users(id serial primary key,email text not null,money bigint not null default 0
);
Let's make a Venmo clone
postgres=# CREATE TABLE users(id SERIAL NOT NULL,email TEXT NOT NULL
);
-- Now I don't need to specify idspostgres=# INSERT INTO users(email) VALUES (
postgres=# SELECT * FROM users; id | email ----+---------------------------- 0 | [email protected] (1 row)
Aside: serials
# Make sure user 1 has enough moneypostgres=# select money from users where id = 1;
# Decrease user 1's moneypostgres=# update users set money=money – 20 where id=1;
# What happens if our webserver disconnects here?
# Increase user 2's moneypostgres=# update users set money=money + 20 where id=2;
Sending $20 from user 1 to user 2
postgres=# begin;
postgres=# select money from users where id = 1;
postgres=# update users set money=money – 20 where id=1;
postgres=# update users set money=money + 20 where id=2;
postgres=# commit;
Transactions
postgres=# begin;
postgres=# select money from users where id = 1;
postgres=# update users set money=money – 20 where id=1;
postgres=# abort; -- revert all changes!
Transactions
• Transactions allow the effects of your code to happen in one go
• Achieve ACID compliance
• Atomicity – everything in the transaction happens or none of it does
• Consistency – after a transaction, the database will be in a valid state. Rules always apply.
• Isolation – a newly committed transaction shouldn't affect your current transaction (there are multiple levels of this)
• Durability – once a transaction is committed, it'll remain so even in case of power loss.
Transactions
// wrapView.jsimport sequelize from './db/sequelize';export default function wrapView(fn) {
return (req, res, next) => { sequelize.transaction(t => {
return fn(req, res)}).catch(next);
};}
// server.tsapp.post("/api/user", wrapView(users.create));
Wrap everything in a transaction
-- Assume user 1 has $20
postgres=# select money from users where id = 1;
postgres=# update users set money=money – 20 where id=1;
# Increase user 2's moneypostgres=# update users set money=money + 20 where id=2;
-- Now user 1 has -$20
Race condition
postgres=# select money from users where id = 1;
postgres=# update users set money=money – 20 where id=1;
# Increase user 3's moneypostgres=# update users set money=money + 20 where id=2;
postgres=# create table users(id serial primary key,email text not null,money bigint not null default 0 check (money > 0)
);
Constraints
postgres=# begin;
postgres=# update users set money=money – 20 where id=1;-- will error out if necessary
postgres=# update users set money=money + 20 where id=2;
postgres=# commit;
Race conditionpostgres=# begin;
postgres=# update users set money=money – 20 where id=1;-- will error out if necessary
postgres=# update users set money=money + 20 where id=3;
postgres=# commit;
# pythonimport transactionfrom pyramid import testing
from moatpro.web import add_routes
class BaseTest(object):def setup_method(self, method):
self.config = testing.setUp()add_routes(self.config)transaction.begin()
def teardown_method(self, method):testing.tearDown()transaction.abort()
Transactions make for safe testing
Performance and indices
postgres=# CREATE TABLE users(postgres-# id SERIAL,postgres-# email TEXT NOT NULLpostgres-#);
postgres=# INSERT INTO users(email)postgres-# SELECT 'test_' || id::text || '@email.com'postgres-# FROM generate_series(0, 9999999) id;INSERT 0 10000000Time: 27608.651 ms
postgres=# SELECT * FROM users; id | email ------+-------------------- 1 | [email protected] 2 | [email protected] 3 | [email protected]...
Let's make a lot of fake data
postgres=# select * from users where email='[email protected]'; id | email ----+------------------ 1 | [email protected](1 row)
Time: 907.094 ms
-- Can we make this faster?
How slow do things get?
postgres=# explain select * from users where email='something';QUERY PLAN ------------------------------------------------------------------------------ Gather (cost=1000.00..131603.33 rows=50000 width=36) Workers Planned: 2 -> Parallel Seq Scan on users (cost=0.00..125603.33 rows=20833 width=36) Filter: (email = '[email protected]'::text)(4 rows)
Explain query
postgres=# CREATE INDEX on users(email);CREATE INDEXTime: 20291.741 ms
postgres=# explain select * from users where email='[email protected]'; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using users_email_idx on users (cost=0.56..8.58 rows=1 width=36) Index Cond: (email = '[email protected]'::text)(2 rows)
Indices
postgres=# select * from users where email='[email protected]'; id | email ----+------------------ 1 | [email protected](1 row)
Time: 1.547 ms
-- Went from ~1s to ~1ms! x1000 speed up!
Indices
postgres=# CREATE UNIQUE INDEX ON users(email);
postgres=# INSERT INTO users(email) VALUES('[email protected]');INSERT 0 1Time: 12.001 ms
postgres=# INSERT INTO users(email) VALUES ('[email protected]'); ERROR: duplicate key value violates unique constraint "users_email_idx"DETAIL: Key (email)=([email protected]) already exists.Time: 1.410 ms
Unique index
postgres=# CREATE TABLE users2(id SERIAL,email TEXT -- NOT NULL
);postgres=# CREATE UNIQUE INDEX ON users2(email);
postgres=# INSERT INTO users2(email) VALUES (null);INSERT 0 1Time: 21.365 ms
postgres=# INSERT INTO users2(email) VALUES (null);INSERT 0 1Time: 11.785 ms
Unique indices... wat?
postgres=# CREATE TABLE players(postgres-# id SERIAL PRIMARY KEY,postgres-# team TEXT NOT NULL,postgres-# jersey INTEGER NOT NULL,postgres-# first_name TEXT NOT NULL,postgres-# last_name TEXT NOT NULLpostgres-#);
postgres=# CREATE UNIQUE INDEX ON players(team, jersey);
You can do indices on multiple columns
postgres=# \d players
Table "public.players" Column | Type | Modifiers ------------+---------+------------------------------------------------------ id | integer | not null default nextval('players_id_seq'::regclass) team | text | not null jersey | integer | not null first_name | text | not null last_name | text | not nullIndexes: "players_pkey" PRIMARY KEY, btree (id) "players_team_jersey_idx" UNIQUE, btree (team, jersey)
A primary key is a not-null, unique constraint with an index
ETL with Postgres
moat ad search
postgres=# create table impressions(brand_id integer not null,ad_id integer not null,created_at timestamp not null default current_timestamp
);
-- For a specific brandand ad, we want results like this:-- day | brand_id | ad_id | num_impressions
Impressions table
postgres=# create table impressions(postgres-# brand_id int not null,postgres-# ad_id int not null,postgres-# created_at timestamp not null default current_timestamppostgres-#);
postgres=# select created_at::date, brand_id, ad_id, count(id)postgres-# from impressionspostgres-# group by created_at::date, brand_id, ad_idpostgres-# where ad_id = foo and brand = bar;
-- This query is a mouthful and it'd be a pain to type out all the time.
Simple group by
postgres=# create view ad_brand_impressions_per_daypostgres-# select created_at::date as day, brand_id, ad_id, count(id)postgres-# from impressionspostgres-# group by created_at::date, brand_id, ad_idpostgres-# where ad_id = foo and brand = bar;
postgres=# select * from num_impressions_per_daypostgres-# where brand_id=?;
-- What if this is slow?
View
-- What if you want to see the number of ads a brand had in a daterange?
postgres=# select brand_id, count(distinct ad_id)postgres-# from num_impressions_per_daypostgres-# where brand_id=<blah> and day between '2017-01-01' and '2017-01-31'postgres-# group by brand_id;
-- What if this is slow?
Number of creatives per brand?
create materialized view impressions_by_dayasselect
created_at::date as day,brand_id,ad_id, count(id) as num_impressions
from impressionsgroup by created_at::date, hostname_id, ad_id;
create unique index on impressions_by_day(day, brand_id, ad_id);
select * from impressions_by_day where brand_id=foo and ad_id=bar;
Materialized view
# Your ETL is now:refresh materialized view impressions_by_day;
# But refreshes block reads, so if you have a unique index,# you can do:refresh materialized view impressions_by_day concurrently;
Materialized view
JSON and Postgres
latency analytics
• Our data is coming in via JSON because it's coming from a browser
• We don't know how our JSON schema will change
• We're going to have a lot of columns
• In this case, maybe it's fine to use JSON column type
Things to consider
postgres=# create table ad_analysis(id serial primary key,
analysis json not null,created_at timestamp not null default current_timestamp,updated_at timestamp not null default current_timestamp
);postgres=# insert into ad_analysis(analysis) values('{"size": 1234}'::json);postgres=# select analysis->>'size' from ad_analysis; ?column? ---------- 1234(1 row)
JSON
postgres=# SELECT '{"c":0, "a":2,"a":1}'::json;json ------------------------ {"c":0, "a":2,"a":1}(1 row)
postgres=# SELECT '{"c":0, "a":2,"a":1}'::json->>'a'; ?column? ---------- 1(1 row)
postgres=# create index on ad_analysis(analysis);ERROR: data type json has no default operator class for access method "btree"
Normal JSON type isn't great
postgres=# SELECT '{"c":0, "a":2,"a":1}'::jsonb; jsonb ------------------ {"a": 1, "c": 0}(1 row)
Time: 0.833 ms
postgres=# create index on ad_analysis(analysis);CREATE INDEXTime: 30.101 ms
-- Really good performance!
Use JSONB
Scaling Postgres @ Moat
moat analytics
• Measure attention online
• Don't track cookies/ip addresses
• We're a neutral third party that publishers and advertisers use
• We process billions of events a day
What we do
lots of data
• Elmo: Last 33 Days
• Marjory: A few years
• Frackles: All of our historical data
• Decanter: aggregates our data
Databases
decanter
create server other_db foreign data wrapper postgres_fdw options (host 'other_db.moat.co',port '5432',dbname 'other_db'
);create user mapping for public server other_db options (
user 'user',password 'password'
);
create schema foo;import foreign schema foo from server search_rds into foo;
# works like a normal tableselect * from foo.table;
Foreign Data Wrapper
• Great way to talk to other databases
• Can talk to other types of data stores too (there's a mysql_fdw, elasticsearch_fdw, etc).
• Careful – the query planner can be dumb. You can put the results in an mview if you want.
Foreign Data Wrappers
• PostgreSQL instance that doesn't store data
• High CPU, high memory, no storage
• Gets results from the other databases via FDW and aggregates them
• Highly available (pgbouncer)
• If the databases down there change, the client doesn't notice
Decanter
marjory
• We have a lot of data
• We're basically bean-counters ("how many times did X happen?". So, we basically need to filter and sum.
• The rows are very wide and very sparse (a lot of NULLs).
• This makes columnar solutions not work well
• This is very compressable
Our data considerations
• 8-10x compression for our data
• CPU bound instead of IO bound
• Pretty great performance
• We use it for some historical data (a few years)
• Inflexible: we're trading generality for fit to our use case
• Probably not great if there's a lot of rows
Marjory
frackles
• Access ALL our historical data
• Needs to be cheap
• Support a lot of concurrency
• Be able to spin up new instances fast for scale
• Bonus: be faster than RedShift
Frackles
• Our database was just a series of SQLite files in S3…
• We used AWS Lambda to pull those S3 files and read from them…
• We used a Python client to manage all of those lambdas and gave it a simple API…
• We wrote our own FDW to talk to the Python client…
• (I was really skeptical)
What if…
• Surprisingly good performance
• It's dirt cheap – you only pay for S3 and Lambdas
• Completely stateless (almost no ops!)
• Constant overhead
• S3 performance has high variance
• Capped on number of concurrent lambdas
• Better on queries with little aggregation
Frackles
In summary
We’re hiring. A [email protected]