Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by...

104
Scaling & Sharding PostgreSQL Principles and Practice Jason Petersen Software Developer, Citus Data Copyright © 2015 Citus Data, Inc. 1

Transcript of Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by...

Page 1: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Scaling & Sharding PostgreSQLPrinciples and Practice

Jason Petersen

Software Developer, Citus Data

Copyright © 2015 Citus Data, Inc. 1

Page 2: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

This talk

Copyright © 2015 Citus Data, Inc. 2

Page 3: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

What we talk about when we talk about sharding:

Copyright © 2015 Citus Data, Inc. 3

Page 4: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

What we talk about when we talk about sharding:

Horizontal partitioning

Copyright © 2015 Citus Data, Inc. 4

Page 5: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Horizontal partitioning […] involves putting different rows

into different tables.— Wikipedia, “Shard (database architecture)”

Copyright © 2015 Citus Data, Inc. 5

Page 6: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Sharding goes beyond this: […] it does this across potentially

multiple instances of the schema.— Also Wikipedia

Copyright © 2015 Citus Data, Inc. 6

Page 7: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Putting our foot down

Sharding is a form of horizontal partitioning which distributes database rows across totally separate physical database servers.

Copyright © 2015 Citus Data, Inc. 7

Page 8: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

A form of horizontal partitioning which distributes rows across

totally separate physical database servers.

— Citus Data

Copyright © 2015 Citus Data, Inc. 8

Page 9: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

What is Citus Data?

Copyright © 2015 Citus Data, Inc. 9

Page 10: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

(Pronounced like “Midas”)

Copyright © 2015 Citus Data, Inc. 10

Page 11: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

(We make CitusDB)

Copyright © 2015 Citus Data, Inc. 11

Page 12: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

What is CitusDB?

— Scalable analytics DB

— Extends PostgreSQL

— Brings distributed query logic

— Supports all types, extensions

— Does it all using sharding

Copyright © 2015 Citus Data, Inc. 12

Page 13: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

You may be thinking…click_events_2012.

Node%#1%(PostgreSQL)%

click_events_2013.

Node%#2%

click_events_2014.

Node%#3%

Copyright © 2015 Citus Data, Inc. 13

Page 14: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

How doesthat scale?Copyright © 2015 Citus Data, Inc. 14

Page 15: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Not very well…

Node%#4%

click_events_2012.

Node%#1%

(4#TB)# click_events_2013.

Node%#2%

(4#TB)# click_events_2014.

Node%#3%

(4#TB)#

Copyright © 2015 Citus Data, Inc. 15

Page 16: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Not very well…

click_events_2012.

Node%#1%

(4#TB)# click_events_2013.

Node%#2%

(4#TB)# click_events_2014.

Node%#3%

(4#TB)#

Node%#4%

1#TB#(each)#

Copyright © 2015 Citus Data, Inc. 16

Page 17: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

What about loadcharacteristics?

Copyright © 2015 Citus Data, Inc. 17

Page 18: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Not great, either…

click_events_2012.

Node%#1%

click_events_2013.

Node%#2%

click_events_2014.

Node%#3%click_events_2012.

Node%#4%click_events_2013.

Node%#5%

click_events_2014.

Node%#6%

Copyright © 2015 Citus Data, Inc. 18

Page 19: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Not great, either…

click_events_2012.

Node%#1%

click_events_2013.

Node%#2%

click_events_2014.

Node%#3%click_events_2012.

Node%#4%click_events_2013.

Node%#5%

click_events_2014.

Node%#6%

Copyright © 2015 Citus Data, Inc. 19

Page 20: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

So what to do?

Copyright © 2015 Citus Data, Inc. 20

Page 21: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

… when initially implementing sharding you’ll want to create an

arbitrary number of logical shards.

— Craig Kerstiens, “Sharding Your Database”

Copyright © 2015 Citus Data, Inc. 21

Page 22: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

“Logical”?Copyright © 2015 Citus Data, Inc. 22

Page 23: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

[the] system consists of several thousand ‘logical’ shards that are

mapped in code to far fewer physical shards…

— “Sharding & IDs at Instagram”

Copyright © 2015 Citus Data, Inc. 23

Page 24: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

… we can start with just a few database servers, and eventually

move to many more…— “Sharding & IDs at Instagram”

Copyright © 2015 Citus Data, Inc. 24

Page 25: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

A better approach

Node%#1%(PostgreSQL)%

1" 3" 4"

6" 7" 9"

…" …" …"

…" …" …"

Node%#2%

1" 2" 4"

5" 7" 8"

…" …" …"

…" …" …"

Node%#3%

2" 3" 5"

6" 8" 9"

…" …" …"

…" …" …"

Copyright © 2015 Citus Data, Inc. 25

Page 26: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Easier growth…

Node%#4%

Node%#1%(PostgreSQL)%

1" 3" 4"

6" 7" 9"

…" …" …"

…" …" …"

Node%#2%

1" 2" 4"

5" 7" 8"

…" …" …"

…" …" …"

Node%#3%

2" 3" 5"

6" 8" 9"

…" …" …"

…" …" …"

512$MB$(each)$

Copyright © 2015 Citus Data, Inc. 26

Page 27: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Graceful failure…

Node%#1%(PostgreSQL)%

1" 6" 7"

…" …" …"

…" …" …"

…" …" …"

Node%#2%

1" 2" 7"

…" …" …"

…" …" …"

…" …" …"

Node%#3%

2" 3" 8"

…" …" …"

…" …" …"

…" …" …"

Node%#4%

3" 4" 8"

…" …" …"

…" …" …"

…" …" …"

Node%#5%

4" 5" 9"

…" …" …"

…" …" …"

…" …" …"

Node%#6%

5" 6" 9"

…" …" …"

…" …" …"

…" …" …"

Copyright © 2015 Citus Data, Inc. 27

Page 28: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Graceful failure…

Node%#1%(PostgreSQL)%

1" 6" 7"

…" …" …"

…" …" …"

…" …" …"

Node%#2%

1" 2" 7"

…" …" …"

…" …" …"

…" …" …"

Node%#3%

2" 3" 8"

…" …" …"

…" …" …"

…" …" …"

Node%#4%

3" 4" 8"

…" …" …"

…" …" …"

…" …" …"

Node%#5%

4" 5" 9"

…" …" …"

…" …" …"

…" …" …"

Node%#6%

5" 6" 9"

…" …" …"

…" …" …"

…" …" …"

Copyright © 2015 Citus Data, Inc. 28

Page 29: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Logical!Copyright © 2015 Citus Data, Inc. 29

Page 30: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Logical shard benefits

— Enables rebalancing

— Better failure modes

— More granular migrations

— Performance benefits

Copyright © 2015 Citus Data, Inc. 30

Page 31: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

But wait!Copyright © 2015 Citus Data, Inc. 31

Page 32: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Sharding concerns

— Operations burden

— Network resiliency

— ACID tradeoffs?

— No return

Copyright © 2015 Citus Data, Inc. 32

Page 33: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

(Should be your last)

Copyright © 2015 Citus Data, Inc. 33

Page 34: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Pyramids!Copyright © 2015 Citus Data, Inc. 34

Page 35: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Self-Actualiza.on

Esteem

Love/Belonging

Safety

Physiological

Copyright © 2015 Citus Data, Inc. 35

Page 36: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Sharding!

SplitLoad

HardwareandTuning

DatabaseSchema

Applica;onCode

Copyright © 2015 Citus Data, Inc. 36

Page 37: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Always useSCIENCE

Copyright © 2015 Citus Data, Inc. 37

Page 38: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Getting to ScalePrinciples

1. Generate realistic load

2. Measure, measure, measure…

3. Change one thing

4. Determine the impact

5. GOTO the first step

Copyright © 2015 Citus Data, Inc. 38

Page 39: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Getting to ScaleDetermining Workload

— pg_stat_statements

— pgBadger

— PoWA

— New Relic

Copyright © 2015 Citus Data, Inc. 39

Page 40: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Getting to ScaleGenerating Load

— pgbench

— apachebench

— jmeter

— Fill up your queue!

Copyright © 2015 Citus Data, Inc. 40

Page 41: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Getting to Scale

Measuring

— Long runtimes

— Eliminate hidden unknowns

— pgbench-tools

— time

Copyright © 2015 Citus Data, Inc. 41

Page 42: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Changeand

CompareCopyright © 2015 Citus Data, Inc. 42

Page 43: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Sharding!

SplitLoad

HardwareandTuning

DatabaseSchema

Applica;onCode

Copyright © 2015 Citus Data, Inc. 43

Page 44: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Getting to sharding…Optimize application logic

Add caching. Use connection pools. Bundle writes and issue them in batches. Use JOINs judiciously. Dig beneath your ORM.

Copyright © 2015 Citus Data, Inc. 44

Page 45: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Getting to sharding…✓ Optimize application logic

Tweak schemas

Denormalize where necessary. Add indexes to all commonly used columns. Locally partition tables if it makes sense. Move hot columns to separate tables.

Copyright © 2015 Citus Data, Inc. 45

Page 46: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Getting to sharding…✓ Optimize application logic

✓ Tweak schemas

Upgrade and tune

Benchmark your system. Determine resource bottlenecks. Upgrade. Tune postgresql.conf to within an inch of its life. Do the same1 for your OS.

1 Check out Brendan Gregg’s USE Method

Copyright © 2015 Citus Data, Inc. 46

Page 47: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Getting to sharding…✓ Optimize application logic

✓ Tweak schemas

✓ Upgrade and tune

Try replication

Use a read replica. Use read replicas for every distinct workload (to avoid background jobs evicting your app’s working set from the DB cache).

Copyright © 2015 Citus Data, Inc. 47

Page 48: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Getting to sharding…✓ Optimize application logic

✓ Tweak schemas

✓ Upgrade and tune

✓ Try replication

Split writes

Modularize concerns within your app to isolate write-heavy tables to their own database.

Copyright © 2015 Citus Data, Inc. 48

Page 49: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

You’ve already…✓ Optimized application logic

✓ Tweaked schemas

✓ Upgraded and tuned

✓ Tried replication

✓ Split writes

When you’re on the best hardware with a tuned OS, optimized queries, and servers devoted to each workload and you’re still worried about scaling?

Copyright © 2015 Citus Data, Inc. 49

Page 50: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

You’re ready to shard.

Copyright © 2015 Citus Data, Inc. 50

Page 51: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Our dream extension

— Creates and manages shards

— Uses regular SQL commands

— Supports replicas/failover

— Integrated with CitusDB

Copyright © 2015 Citus Data, Inc. 51

Page 52: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Copyright © 2015 Citus Data, Inc. 52

Page 53: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Motivation

— Real-time ingest for CitusDB

— Customers building their own

— Could be NoSQL alternative

Copyright © 2015 Citus Data, Inc. 53

Page 54: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

User needs

— Dynamic rebalancing/scaling

— “Automagic” failure handling

— Transactions not so important

Copyright © 2015 Citus Data, Inc. 54

Page 55: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Good News, Everyone!

Copyright © 2015 Citus Data, Inc. 55

Page 56: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Upcoming Developments

— Streamlining offerings

— CitusDB soon open-source

— Extension, not standalone

— Real-time modifications

— Contact us for early access

Copyright © 2015 Citus Data, Inc. 56

Page 57: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Sharding principles

Copyright © 2015 Citus Data, Inc. 57

Page 58: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Principles of sharding

— Need to know where to put rows

— And where to find stored ones

— Designate a dimension of data as key

— In relational databases: a column

— Logical shard covers range of values

Copyright © 2015 Citus Data, Inc. 58

Page 59: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Visualized

MongoDB uses logical shards, but calls them “chunks”. Weird, but they made a decent diagram2 of the concept:

2 From the MongoDB Manual, “Shard Keys”

Copyright © 2015 Citus Data, Inc. 59

Page 60: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Shard key refinements

— Pass into hash function(smooths out distribution)

— Use contiguous range

— Specify a list of columns

— Generalize to any expression

Copyright © 2015 Citus Data, Inc. 60

Page 61: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Choosing a key

Copyright © 2015 Citus Data, Inc. 61

Page 62: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

The field you choose as your hashed shard key should have a

good cardinality.— MongoDB Manual, “Shard Keys”

Copyright © 2015 Citus Data, Inc. 62

Page 63: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

… the correct shard key can have a great impact on […]

performance [and] capability…— ib., “Considerations for Selecting Shard Keys”

Copyright © 2015 Citus Data, Inc. 63

Page 64: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Choosing a key

— What is most important to your application?

— Spreading incoming writes

— Targeting reads to reduce latency

— Consider key frequently in WHERE clauses

— Use a hybrid approach when it makes sense(shard on customer, partition on time)

— Mind the “hot spots”Copyright © 2015 Citus Data, Inc. 64

Page 65: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Costs of poor choice

— Cross-shard scans hurt performance

— Low cardinality limits ultimate scalability

— Switching keys after distribution burdensome

Copyright © 2015 Citus Data, Inc. 65

Page 66: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

So how doesthis thing work?

Copyright © 2015 Citus Data, Inc. 66

Page 67: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Installation

— Build from GitHub source

— pgxnclient install pg_shard

— sudo yum install pg_shard_94

— CloudFormation templates3

3 Available on the Citus Data blog

Copyright © 2015 Citus Data, Inc. 67

Page 68: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Master'Node'(PostgreSQL'+'pg_shard)'

1" 3" 4"

6" 7" 9"

…" …" …"

…" …" …"

Worker'Node'#1'

1" 2" 4"

5" 7" 8"

…" …" …"

…" …" …"

Worker'Node'#2'

2" 3" 5"

6" 8" 9"

…" …" …"

…" …" …"

Worker'Node'#3'

shard'and'shard'placement'metadata'

Copyright © 2015 Citus Data, Inc. 68

Page 69: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Master node

— Holds authoritative shard state

— One metadata row per:

— Sharded table

— Shard

— Placement

— Just regular tablesCopyright © 2015 Citus Data, Inc. 69

Page 70: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Master failure

Increasing by acceptable downtime…

1. Use streaming replication and failover

2. Use EBS volume for data directory

3. Restore from pg_dump, etc.

4. Reconstruct from workers

Copyright © 2015 Citus Data, Inc. 70

Page 71: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Metadata structure

postgres=# SELECT * FROM pgs_distribution_metadata.shard;

id | relation_id | storage | min_value | max_value-------+-------------+---------+-------------+------------- 10004 | 177880 | t | -2147483648 | -1879048194 10005 | 177880 | t | -1879048193 | -1610612739 10006 | 177880 | t | -1610612738 | -1342177284 10007 | 177880 | t | -1342177283 | -1073741829 10008 | 177880 | t | -1073741828 | -805306374 10009 | 177880 | t | -805306373 | -536870919 ... | ... | ... | ... | ...

Copyright © 2015 Citus Data, Inc. 71

Page 72: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Worker nodes

— Logical shards are placed on nodes

— Each placement is one PostgreSQL table

— Object names extended by shard identifiere.g. click_events_1001 for shard 1001

— Indexes, constraints propagated at creation

Copyright © 2015 Citus Data, Inc. 72

Page 73: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Worker failure

— Unreachable nodes marked as inactive

— Repair with master_copy_shard_placement

1. Replay DDL commands for table, objects

2. Copy data from healthy node

3. Update master metadata

Copyright © 2015 Citus Data, Inc. 73

Page 74: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

First steps…

Copyright © 2015 Citus Data, Inc. 74

Page 75: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Distributing a table

-- create regular table and some indexesCREATE TABLE users ( id integer NOT NULL, name text NOT NULL, birthday date NOT NULL, CONSTRAINT name_present CHECK (btrim(name) != '') );

CREATE INDEX id_idx ON users (id);CREATE INDEX bday_idx ON users (birthday);CREATE INDEX name_idx ON users (name);CREATE INDEX pfx_idx ON users (lower(name) text_pattern_ops);

Copyright © 2015 Citus Data, Inc. 75

Page 76: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Distributing a table

CREATE EXTENSION IF NOT EXISTS pg_shard;

-- designate table as distributed; specify keySELECT master_create_distributed_table('users', 'id');

-- create sixteen shards, each with two copiesSELECT master_create_worker_shards('users', 16, 2);

Copyright © 2015 Citus Data, Inc. 76

Page 77: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Just use SQL!

INSERT INTO users VALUES (1, 'Jason Petersen', '2015-03-23');INSERT INTO users VALUES (2, 'Ozgun Erdogan', '2013-02-11');INSERT INTO users VALUES (3, 'Ageless', NULL);INSERT INTO users VALUES (4, ' ', '2010-08-17');

DELETE FROM users WHERE id = 2;

UPDATE users SET birthday = '1900-06-01' WHERE id = 1;

SELECT name FROM users WHERE id = 1;SELECT max(birthday) FROM users;

Copyright © 2015 Citus Data, Inc. 77

Page 78: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Under the hood

Copyright © 2015 Citus Data, Inc. 78

Page 79: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

PostgreSQL hooks

— Full control over command lifecycle

— Specific hooks for specific needs:

— Planning

— Execution (Start, Run, Finish, End)

— Utility

Copyright © 2015 Citus Data, Inc. 79

Page 80: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Planning phase

— Determine whether distributed

— Fall through to PostgreSQL if not(enables regular tables on master!)

— Find involved shards based on shard key

— Deparse query to shard-specific SQL

Copyright © 2015 Citus Data, Inc. 80

Page 81: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Planning example

Starting with the input SQL…

INSERT INTO users VALUES (5, 'Tom Lane', '2005-07-08');

Copyright © 2015 Citus Data, Inc. 81

Page 82: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Planning example

… determine the partition key clauses…

(id = 5)

Copyright © 2015 Citus Data, Inc. 82

Page 83: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Planning example

… use them to find the proper shard…

SELECT id FROM pgs_distribution_metadata.shardWHERE hashint4(5) BETWEEN min_value::integer AND max_value::integer AND relation_id = 'users'::regclass;

# id # -------# 10003

Copyright © 2015 Citus Data, Inc. 83

Page 84: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Planning example

… generate shard-specific SQL…

INSERT INTO users_10003 VALUES (5, 'Tom Lane', '2005-07-08');

… and send it to the shard’s placements.

Copyright © 2015 Citus Data, Inc. 84

Page 85: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Execution

Now we know what the SQL is and where it should be routed. Execution logic differs depending if the query is a SELECT or a modification.

Copyright © 2015 Citus Data, Inc. 85

Page 86: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Distributed modification

— Locks enforce safe commutation

— Replicas visited in predictable order

— Per-session libpq connection pool

— If replica errors out, mark as inactive

Copyright © 2015 Citus Data, Inc. 86

Page 87: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

1" 3" 4"

6" 7" 9"

…" …" …"

…" …" …"

Worker&Node&#1&

1" 2" 4"

5" 7" 8"

…" …" …"

…" …" …"

Worker&Node&#2&

2" 3" 5"

6" 8" 9"

…" …" …"

…" …" …"

Worker&Node&#3&

Single'shard-INSERT'Replica1on-factor:-2-

Master&

INSERT"INTO"customer_reviews"...&

Copyright © 2015 Citus Data, Inc. 87

Page 88: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

1" 3" 4"

6" 7" 9"

…" …" …"

…" …" …"

Worker&Node&#1&

1" 2" 4"

5" 7" 8"

…" …" …"

…" …" …"

Worker&Node&#2&

2" 3" 5"

6" 8" 9"

…" …" …"

…" …" …"

Worker&Node&#3&

Single'shard-INSERT-One-replica-fails-

Master&

INSERT"INTO"customer_reviews"...&

Copyright © 2015 Citus Data, Inc. 88

Page 89: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

1" 3" 4"

6" 7" 9"

…" …" …"

…" …" …"

Worker&Node&#1&

1" 2" 4"

5" 7" 8"

…" …" …"

…" …" …"

Worker&Node&#2&

2" 3" 5"

6" 8" 9"

…" …" …"

…" …" …"

Worker&Node&#3&

Single'shard-INSERT-Master-marks-inac3ve-

Master&

Sets&shard&6,&node&3&to&inac8ve&status&

INSERT"INTO"customer_reviews"...&

Copyright © 2015 Citus Data, Inc. 89

Page 90: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Modification semantics

— Consistent (read your own writes)

— Safety comes from commutativity rules

— Can reorder SELECTs and INSERTs

— Not so for UPDATEs and DELETEs

— Constraints require predictable order

Copyright © 2015 Citus Data, Inc. 90

Page 91: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Targeted SELECT

— Fetch entire result from single shard

— Failover to anther replica on error

— Do not modify state if failure occurs

— Common key-value access pattern

Copyright © 2015 Citus Data, Inc. 91

Page 92: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

1" 3" 4"

6" 7" 9"

…" …" …"

…" …" …"

Worker&Node&#1&

1" 2" 4"

5" 7" 8"

…" …" …"

…" …" …"

Worker&Node&#2&

2" 3" 5"

6" 8" 9"

…" …" …"

…" …" …"

Worker&Node&#3&

Targeted(SELECT&Try(first(placement(

Master&

SELECT"*"FROM"customer_reviews""""""""""WHERE"customer_id"="'HN892';&

Copyright © 2015 Citus Data, Inc. 92

Page 93: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

1" 3" 4"

6" 7" 9"

…" …" …"

…" …" …"

Worker&Node&#1&

1" 2" 4"

5" 7" 8"

…" …" …"

…" …" …"

Worker&Node&#2&

2" 3" 5"

6" 8" 9"

…" …" …"

…" …" …"

Worker&Node&#3&

Targeted(SELECT(Encounter(error(

Master&

SELECT"*"FROM"customer_reviews""""""""""WHERE"customer_id"="'HN892';&

Copyright © 2015 Citus Data, Inc. 93

Page 94: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

1" 3" 4"

6" 7" 9"

…" …" …"

…" …" …"

Worker&Node&#1&

1" 2" 4"

5" 7" 8"

…" …" …"

…" …" …"

Worker&Node&#2&

2" 3" 5"

6" 8" 9"

…" …" …"

…" …" …"

Worker&Node&#3&

Targeted(SELECT(Try(next(placement(

Master&

SELECT"*"FROM"customer_reviews""""""""""WHERE"customer_id"="'HN892';&

Copyright © 2015 Citus Data, Inc. 94

Page 95: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard

Limitations

— Transactions cannot…

— involve multiple shards

— span multiple statements

— Cross-shard constraints unenforced

Copyright © 2015 Citus Data, Inc. 95

Page 96: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

What are people building?

pg_shard’s capabilities and limitations are similar to those of many popular NoSQL solutions.

Copyright © 2015 Citus Data, Inc. 96

Page 97: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

What are people building?

pg_shard in Production

— Clickstream event data

— HyperLogLog4 for scalable UNIQUEs

— 30,000 INSERTs/second ingest

— Around 200GB data already

— CitusDB SELECTs: 100x faster

4 “HyperLogLog data structures as a native [PostgreSQL] data type”

Copyright © 2015 Citus Data, Inc. 97

Page 98: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Upcoming features?

— More SQL coverage

— Rebalancing

— Multi-master

— Auto-recovery

— INSERT streaming/pipelining

— Suggestions welcome

Copyright © 2015 Citus Data, Inc. 98

Page 99: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Scaling summary

— Explore these avenues first!

— Many little experiments

— Cross-cutting; whole-stack

— Get out every ounce

Copyright © 2015 Citus Data, Inc. 99

Page 100: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Sharding summary

— Shard once you rule out all else

— Use many small “logical shards”

— Think carefully when picking key

— pg_shard/CitusDB merging!

Copyright © 2015 Citus Data, Inc. 100

Page 101: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

pg_shard summary

— Open source sharding for PostgreSQL

— First-class PostgreSQL extension

— LOAD, CREATE TABLE, distribute

— https://github.com/citusdata/pg_shard

Copyright © 2015 Citus Data, Inc. 101

Page 102: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Contact

— Jason: [email protected]

— General: [email protected]

Copyright © 2015 Citus Data, Inc. 102

Page 103: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

QuestionsCopyright © 2015 Citus Data, Inc. 103

Page 104: Jason Petersen - Citus Datainfo.citusdata.com/rs/235-CNE-301/images/Sharding_and...Increasing by acceptable downtime… 1. Use streaming replication and failover 2. Use EBS volume

Copyright © 2015 Citus Data, Inc. 104