Spilo, отказоустойчивый PostgreSQL кластер / Oleksii Kliukin (Zalando SE)

Spilo, highly-available PostgreSQL cluster

Oleksii Kliukin Zalando SE

Zalando• 15 EU countries • 3 fulfilment

centers • 15+ million

active customers • 2.2 billion €

revenue 2014

150 000+ products

We are growing!

Zalando platform

Our databases• >150 production Postgresql

databases • >13.5 TB data • >5 TB biggest DB • 400-1000+ write tps • >2 DB failures/month

Zalando never sleeps

Infrastructure bottleneck

ACID Teamcreate alter deploy migrate failover upgrade

80+ teams

Radical Agility

Purpose

Autonomy

Mastery

Cloud• 2013: ZCloud

• 2014: project Pequod

• 2015: Let’s just use AWS…

Amazon 3-letter words

• AWS - amazon web services • EC2 - elastic compute cloud • ELB - elastic load balancer • RDS - relational DB service

AWS• One account per team

• Microservices

• REST/OAuth2

• Deployment with Docker

Autonomous teams on AWS

REST

INTERNET

Autonomous teams• Team decides which product to

build • … and which technologies to use

• REST/OAuth2 mandatory

• Team is responsible for its infrastructure

Databases?• Developers should take care

of infrastructure

• ..including production databases

• On AWS!

Isn’t it dangerous?

DBAs running with scissors, by Gavin M. Roy: https://www.flickr.com/photos/gavinmroy/4638958958

https://www.flickr.com/photos/gavinmroy/4638958958

ACID team provides

PostgreSQL trainings

What about failover?

Autofailover tasks

• Detect the master failure

• Elect a new master

• Redirect clients

Autofailover issues

• Discarded writes

• Split-brain

• False positives

RDS?• Support for PostgreSQL

• Automatic failover

• Most extensions

• Automatic backups

RDS?• Vendor lock

• No superuser

• No untrusted languages

• No logical decoding plugins

• Rather expensive

EC2 + Linux HA

• Complex setup

• Lots of manual steps(i.e. new replica creation)

Spilo (!"#$%)

Spilo does

• Rapid deployment of PostgreSQL on AWS EC2 instances

• Streaming replication with auto-failover

Spilo on AWS

Spilo MASTER

Spilo REPLICA

Spilo REPLICA

Master connection

Application DB request

ETCD cluster status update

Failover

Spilo REPLICA

Spilo REPLICA

Master connection



Failover

Spilo MASTER

Spilo REPLICA

Master connection



NEW SPILO STARTS…

Failover

Spilo MASTER

Spilo REPLICA

Master connection



Spilo REPLICA

What is Spilo?

cPatroni

MASTER

cPatroni

REPLICA

cPatroni

REPLICA

Auto-scaling group Auto-scaling group

Patroni ("&'(%)#)• Handles new replicas and

failover

• Based on ideas and code of the Compose Governor

• Open-source

Compose Governor idea

Core to our PostgreSQL HA system is the Governor application which uses etcd as its repository of truth to discover which database instance is leader.

Distributed configuration systems

• Fault tolerant

• Reliably store small amounts of strongly-consistent data between distributed nodes

• Good for storing the PostgreSQL cluster state

Distributed consensus

LEADER

CLIENT CLIENT CLIENT

Cluster state in etcd$ etcdctl ls --recursive /service /service/batman /service/batman/optime /service/batman/optime/leader /service/batman/members /service/batman/members/postgresql0 /service/batman/members/postgresql1 /service/batman/initialize /service/batman/leader

Leader key$ etcdctl get /service/batman/leader postgresql0

• Points to the member key • Has a TTL, autoexpires • Acts as an exclusive lock • Only the leader can become

the master

Leader TTL$ http http://127.0.0.1:2379/v2/keys/service/batman/leader … { "action": "get", "node": { "createdIndex": 48723, "expiration": "2015-10-23T14:51:49.686506977Z", "key": "/service/batman/leader", "modifiedIndex": 49037, "ttl": 27, "value": "postgresql0" } }

Member key$ etcdctl get /service/batman/members/postgresql0

{“role":"master", “state”:"running", “conn_url”:"postgres://replicator:[email protected]:5432/postgres", “api_url”:"http://127.0.0.1:8008/patroni", "xlog_location":67108960}

Connection and API URL

cPatroni

cPatroni

API URL (check health

during promotion)

MASTER

REPLICA

CONNECTION URL

MASTER LB

REPLICA LB

CONNECTION URL

Initialize key$ etcdctl get /service/batman/initialize 6208852353820383446

• PostgreSQL cluster system ID • Created by the first node that

joins the cluster • Nodes with different system

ID are not allowed to join

Patroni modules

ETCD ZOOKEEPER

ABSTRACT DCS PostgreSQL REST API

High availability

Asynchronous executor

Callbacks

Demo time!

https://asciinema.org/a/29087

https://asciinema.org/a/29087

From Governor to PatroniGovernor

Patroni

Location of etcd: original

cGovernor

cGovernor

cGovernor

Replace etcd with proxy

cGovernor

cGovernor

cGovernor

Proxy

Proxy

Proxy

Embed etcd client in Patroni

cPatroni

cPatroni

cPatroni

Patroni improvements• Robust exception handling • Run long-running tasks (i.e.

base backup in a separate thread)

• ETCD + Zookeeper • Rest API

Patroni improvements

• Configurable replica imaging

• Support for pg_rewind

Patroni improvements• Manual failover • Initialize from external

cluster • Attach to already running

PostgreSQL nodes • Tags (i.e. nofailover)

What you should monitor• replication lag • unhealthy member • no leader • etcd/

Zookeeper

Thank you!• Spilo:

github.com/zalando/spilospilo.readthedocs.org

• Patroni:github.com/zalando/patronipatroni.readthedocs.org

• Feedback: @alexeyklyukin

http://github.com/zalando/spilo

http://spilo.readthedocs.org

http://github.com/zalando/patroni

http://patroni.readthedocs.org

Spilo, отказоустойчивый PostgreSQL кластер / Oleksii Kliukin (Zalando SE)

Engineering

Transcript of Spilo, отказоустойчивый PostgreSQL кластер / Oleksii Kliukin (Zalando SE)