Percona Live 2014 - Scaling MySQL in AWS

47
Scaling MySQL in AWS Presented by: Laine Campbell April 3rd, 2014

description

Laine Campbell, CEO of Blackbird, will explain the options for running MySQL at high volumes at Amazon Web Services, exploring options around database as a service, hosted instances/storages and all appropriate availability, performance and provisioning considerations using real-world examples from Call of Duty, Obama for America and many more. Laine will show how to build highly available, manageable and performant MySQL environments that scale in AWS—how to maintain then, grow them and deal with failure. Some of the specific topics covered are: * Overview of RDS and EC2 – pros, cons and usage patterns/antipatterns. * Implementation choices in both offerings: instance sizing, ephemeral SSDs, EBS, provisioned IOPS and advanced techniques (RAID, mixed storage environments, etc…) * Leveraging regions and availability zones for availability, business continuity and disaster recovery. * Scaling patterns including read/write splitting, read distribution, functional dataset partitioning and horizontal dataset partitioning (aka sharding) * Common failure modes – AZ and Region failures, EBS corruption, EBS performance inconsistencies and more. * Managing and mitigating cost with various instance and storage options

Transcript of Percona Live 2014 - Scaling MySQL in AWS

Page 1: Percona Live 2014 - Scaling MySQL in AWS

Scaling MySQL in AWSPresented by: Laine CampbellApril 3rd, 2014

Page 2: Percona Live 2014 - Scaling MySQL in AWS

Agenda1. Overview of options: RDS and EC2/MySQL2. MySQL scaling patterns3. Performance/Availability4. Implementation choices5. Common failure patterns

Page 3: Percona Live 2014 - Scaling MySQL in AWS

Who the *&%^#$ am I?Laine CampbellCo-Founder and CEO, Blackbird (formerly PalominoDB)

9 years building the DB team/infrastructure at Travelocity.

7 years at PalominoDB/Blackbird, supporting 50+ companies, 1000s of databases and way too much coffee.

Page 4: Percona Live 2014 - Scaling MySQL in AWS

AWS Options for MySQL:RDS and EC2/MySQL

A love story...

Page 5: Percona Live 2014 - Scaling MySQL in AWS

AWS Relational Database Service (RDS)

Basic Operations Managed

Ease of Deployment

Supports Scaling via Replication

Reliable via Replication, EBS RAID, Multi-AZ

Page 6: Percona Live 2014 - Scaling MySQL in AWS

Managed Operations

Backups and Recovery

Provisioning

Patching

Auto Failover

Replication

Page 7: Percona Live 2014 - Scaling MySQL in AWS

RDS Backup and RecoveryStorage is done via EBS

Snapshot and binlog based (point in time)

A Non Multi-AZ implementation creates spikes in latency during backups

Avoided in Multi-AZ via backups on the secondary

Snapshots only

Page 8: Percona Live 2014 - Scaling MySQL in AWS

Advanced Backup and RecoveryCreating non-RDS backups done via mysqldump,

mydumper, custom extraction

You can create non-RDS replicas using a logical backup in 5.6 only

non-RDS replicas will break during AZ failovers - thus not useful for production or for large datasets

Page 9: Percona Live 2014 - Scaling MySQL in AWS

Disaster RecoveryCross region replication is

supported in 5.6

Cross region replication incurs cross-region data transfer costs

Relay replicas recommended if you wish to minimize expenses

Page 10: Percona Live 2014 - Scaling MySQL in AWS

ProvisioningInitial creation of single or multi-AZ

masters

Single command replica creation (serialized)via snapshots, multi-AZ avoids a

one minute IO suspension.

Page 11: Percona Live 2014 - Scaling MySQL in AWS

PatchingAutomatically managed in

maintenance windows

Alerts sent for the coming week, so you can determine impact, reschedule, etc…

Multi-AZ mitigates impact of invasive maintenance

Page 12: Percona Live 2014 - Scaling MySQL in AWS

RDS Challenges (Opportunities?)

Abstraction from kernel, OS processlist, OS commands etc...

No SUPER access, changes to management via Stored Procedure (minimal but annoying)

Log access becomes more challenging (but manageable)

The more experienced of an operator you are, the grumpier you will be!

Page 13: Percona Live 2014 - Scaling MySQL in AWS

RDS Challenges (Opportunities?)

Snapshot backups not portable/accessible outside of RDS

Multi-AZ failover can strand replicas when relaxing binlog consistency for performance. (sync_binlog=0).

Without the ability to manually CHANGE MASTER, one must rebuild all replicas after a failover.

Page 14: Percona Live 2014 - Scaling MySQL in AWS

RDS Visibility Impacts

Agent based instrumentation that requires localhost installation won’t work

No access to TCPDUMP/Port listening

SAR, processlist for swapping, vmstat, iostat etc...

Log forensics become harder but manageable (must download first)

Page 15: Percona Live 2014 - Scaling MySQL in AWS

EC2 and MySQL

All the MySQL you’ve come to love and hate

Any topologies you can dream

Access to many more types of instances and storage

Page 16: Percona Live 2014 - Scaling MySQL in AWS

Why RDS or EC2?

You can’t run 5.6, and you can’t tolerate the risk of single region? (~99.65% SLA per month) Use EC2

You don’t have operational expertise to manage backups, provisioning and replication? Use RDS

pro-tip, if you can’t manage a system, how can you troubleshoot advanced performance issues with the visibility issues in RDS?

Page 17: Percona Live 2014 - Scaling MySQL in AWS

Why RDS or EC2?

Want MariaDB, XtraDB? Use EC2

Large data-sets generally require file level backups and portability? Use EC2

pro-tip, if you can’t get a mysqldump or a parallel dump to load/export in a timely fashion, you probably don’t want RDS

Page 18: Percona Live 2014 - Scaling MySQL in AWS

Scaling Patterns for MySQL in AWS

Page 19: Percona Live 2014 - Scaling MySQL in AWS

Scaling in RDS - Vertical

RAM up to 244 GB per instance, creating excellent ability to put large datasets in RAM

Network performance up to 10 GB

CPU up to 32 cores

Provisioned IOPs are game changers, and mandatory for production, performance sensitive applications.

Page 20: Percona Live 2014 - Scaling MySQL in AWS

Scaling in RDS - Provisioned IOPs

1,000 - 30,000 IOPS100 GB to 3 TBStable, predictable IO

Realizing Max IOPS - 20,000

● cr1.8xlarge Instance Type● MySQL 16 KB Page Size● Full Duplex IO Channel● 50% reads, 50% writes

Page 21: Percona Live 2014 - Scaling MySQL in AWS

Scaling in RDS - Provisioned IOPsOverprovisioning from realized, can create latency reductions

● In an unbalanced workload, for instance reads consuming channel limits

● Write channel bandwidth remains unsaturated● By doubling IOPS, you increase concurrency, thus

reducing latency. Transaction rates increase● Consumption of IOPS can reduce as transaction

rates increase, and manifest as:○ Improved use of group commit○ larger log writes

Page 22: Percona Live 2014 - Scaling MySQL in AWS

Scaling in RDS - Reads

Native replication allows for scale out of reads, just as in EC2 or your own datacenter

RAM up to 244 GB per instance, creating much better ability to put large datasets in RAM

5.6 allows for the memcache plugin

Page 23: Percona Live 2014 - Scaling MySQL in AWS

Scaling in RDS - Writes

Like any system, you must split workloads if writes consume max capacity of PIOPS.

● Functional Partitioning● Sharding

Page 24: Percona Live 2014 - Scaling MySQL in AWS

Scaling in RDS - ConcernsSharding:● Management of RDS instances to roll shards up and

down can be a new paradigm.● Overall, this can be done, but does require a logical

shift.

Resource Constraints:● No access to SSDs (up to 91,250 read or 78,750

write IOPS of 14KB size)

Data Movements:● No access to data copies outside of replica builds

can dramatically increase data movement time costs

Page 25: Percona Live 2014 - Scaling MySQL in AWS

Scaling in EC2 - Vertical

Higher variety of instances. Similar top level constraints of:

● RAM● CPU● PIOPS● Network

Ephemeral storage SSD create a whole new class of IO performance: (up to 91,250 read or 78,750 write IOPS of 14KB size)

Page 26: Percona Live 2014 - Scaling MySQL in AWS

Scaling in EC2 - Reads

In addition to standard MySQL replication, you have new options

● Galera, MariaDB/Galera and XtraDB Cluster● Tungsten Replicator and Cluster

Page 27: Percona Live 2014 - Scaling MySQL in AWS

Scaling in EC2 - Writes

Sharding still becomes necessary, but in EC2 over RDS, one has access to snapshots:

● Management of large datasets becomes much easier

● Shard management functions in more typical paradigms

Page 28: Percona Live 2014 - Scaling MySQL in AWS

Scaling in EC2 - Concerns

SSD and Ephemeral Storage

● Instances become even more volatile● Backups via EBS snapshot are impossible, requiring

LVMs or similar● One might consider keeping writes to PIOPs max

(20,000) for writes and leverage SSD for reads

Page 29: Percona Live 2014 - Scaling MySQL in AWS

Availability for MySQL in AWS

Page 30: Percona Live 2014 - Scaling MySQL in AWS

AWS Availability: Regions and Zones

Page 31: Percona Live 2014 - Scaling MySQL in AWS

AWS Availability: Regions and Zones

Amazon Regions equate to data-centers in different geographical regions.

Availability zones are isolated from one another in the same region to minimize impact of failures.

Page 32: Percona Live 2014 - Scaling MySQL in AWS

AWS Availability: Regions and Zones

Amazon states AZs do not share :

•Cooling•Network•Security•Generators•Facilities

Page 33: Percona Live 2014 - Scaling MySQL in AWS

AWS Availability: Regions and ZonesApr, 2011 - US East Region EBS Failed● Incorrect network failover.● Saturated intra-node communications.● Cascading failures impacted EBS in all AZs.

Jul, 2012 - US East Partial Impact● Electrical storms impacted multiple sites.● Failover of metadata DB took too long.● EBS I/O was frozen to minimize corruption.

Page 34: Percona Live 2014 - Scaling MySQL in AWS

AWS Availability: Regions and Zones99.95% Monthly SLA for a region (multiple AZs)

● Implies multiple AZ is mandatory● Implies multi-region is necessary for 99.99% or

higher

Page 35: Percona Live 2014 - Scaling MySQL in AWS

Availability in RDS - Multi-AZ

The core of an HA solution

Block level replication, active/passive

Saves you from most master crashes

Reduces impact of backups, upgrades, locks for provisioning replicas

When not in 5.6, and using log_sync != 1, you often lose replicas during failover

Page 36: Percona Live 2014 - Scaling MySQL in AWS

Availability in RDS - Multi-AZ

IO impact from replication

You do not get to choose the failover AZ, meaning you must be ready to move app servers

Page 37: Percona Live 2014 - Scaling MySQL in AWS

Availability in RDS - Replicas

Redundant replicas make total sense. N+1 meets most needs with the ease of provisioning

You must have replicas in every AZ you have app servers in (if using replicas for reads)

AWS states cross-AZ latency impact of low single digit millisecond impact. Real world indicates occasional much larger spikes

Page 38: Percona Live 2014 - Scaling MySQL in AWS
Page 39: Percona Live 2014 - Scaling MySQL in AWS

Availability in RDS - Replicas

Redundant replicas make total sense. N+1 meets most needs with the ease of provisioning

You must have replicas in every AZ you have app servers in (if using replicas for reads)

AWS states cross-AZ latency impact of low single digit millisecond impact. Real world indicates occasional much larger spikes

Page 40: Percona Live 2014 - Scaling MySQL in AWS

Availability in EC2 - Options

You can use Galera, XtraDB Cluster, or similar for a read/write anywhere solution

MySQL MHA can be used to do failovers

Continuent’s Tungsten product can also manage failovers

Page 41: Percona Live 2014 - Scaling MySQL in AWS

AWS Benefits: Dynamicity

Page 42: Percona Live 2014 - Scaling MySQL in AWS

AWS Availability: Regions and ZonesType of Change EC2 RDS Master

(Non Multi-AZ)RDS Master(Multi-AZ)

RDS Replica

Instance resize up/down

Rolling Migrations

Moderate Downtime

Minimal Downtime

Moderate Downtime (take out of service)

EBS <-> PIOPS Severe Performance impact.

Severe Performance impact.

Minor Performance impact.

Severe Performance Impact (take out of service)

PIOPS Amount Change

Minor Performance impact.

Minor Performance impact.

Minor Performance impact.

Performance Impact (take out of service)

Disk Space Change (add)

Performance impact.

Performance impact.

Minor Performance impact.

Performance Impact (take out of service)

Disk Space Change (reduce)

Rolling Migrations

Moderate Downtime

Moderate Downtime

Moderate Downtime (take out of service)

Page 43: Percona Live 2014 - Scaling MySQL in AWS

AWS Failure Scenarios

Page 44: Percona Live 2014 - Scaling MySQL in AWS

Predicting and Managing Failure

Operations is about managing change and mitigating risk

Page 45: Percona Live 2014 - Scaling MySQL in AWS

Predicting and Managing FailureLocal Failures

• Database crashes• Human error

o Misconfigureo Write to a replicao Drop a table/database/career

• Localized EBS hangs and corruption• Unacceptable/unpredictable performance

Page 46: Percona Live 2014 - Scaling MySQL in AWS

Predicting and Managing FailureLocal Failures

● When it goes bad, don’t waste time diagnosing.o Shoot it in the head!

● Plan!○ Simulate availability and region level failures○ Wipe storage, reduce IOPS, shut down○ Chaos monkey is your friend

● Observe!○ Monitor for early failures, predict

Page 47: Percona Live 2014 - Scaling MySQL in AWS

Predicting and Managing FailureMitigation

In RDS:Use Multi-AZUse replicas in multiple AZsReplicate to multiple regions, and out of AWS

In EC2:Use a failover (Galera, Tungsten, MHA/HAProxy)Use multiple AZs and regionsFrequent Backups (practicing restores)