© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Steve Abraham – Solutions Architect
December 14, 2016
Deep Dive on Amazon AuroraIncluding New Features
Agenda
What is Aurora?
Review of Aurora performance
New performance enhancements
Review of Aurora availability
New availability enhancements
Other recent and upcoming feature enhancements
Open source compatible relational database
Performance and availability of commercial databases
Simplicity and cost-effectiveness of open source databases
What is Amazon Aurora?
Performance
Write performance Read performance
Aurora scales with instance size for both read and write.
Aurora MySQL 5.6 MySQL 5.7
Scaling with instance sizes
Aurora vs. RDS MySQL – r3.4XL, MAZ
Before : 15ms
After : 5.5ms
Aurora Migration
Aurora 3X faster on r3.4xlarge
Real-life data – gaming workload
Do fewer IOs
Minimize network packets
Cache prior results
Offload the database engine
Do Less WorkProcess asynchronously
Reduce latency path
Use lock-free data structures
Batch operations together
Be More Efficient
Databases are all about i/oNetwork-attached storage is all about packets/secondHigh-throughput processing is all about context switches
How did we achieve this?
BINLOG DATA DOUBLE-WRITELOG FRM FILES
TY PE OF W RI TE
Mysql with replica
EBS mirrorEBS mirror
AZ 1 AZ 2
Amazon S3
EBSAmazon Elastic
Block Store (EBS)
PrimaryInstance
ReplicaInstance
1
2
3
4
5
Issue write to EBS – EBS issues to mirror, ack when both done Stage write to standby instance through DRBD Issue write to EBS on standby instance
Io flow
Steps 1, 3, 4 are sequential and synchronous This amplifies both latency and jitter Many types of writes for each user operation Have to write data blocks twice to avoid torn writes
Observations
780K transactions 7,388K I/Os per million txns (excludes mirroring, standby) Average 7.4 I/Os per transaction
Performance
30 minute sysbench writeonly workload, 100GB dataset, RDS multiaz, 30K PIOPS
IO traffic in MySQL
AZ 1 AZ 3
PrimaryInstance
Amazon S3
AZ 2
ReplicaInstance
ASYNC4/6 QUORUM
DISTRIBUTED WRITES
ReplicaInstance
BINLOG DATA DOUBLE-WRITELOG FRM FILES
TY PE OF W RI TE
Amazon Aurora Boxcar redo log records – fully ordered by LSN Shuffle to appropriate segments – partially ordered Boxcar to storage nodes and issue writes
IO flow
Only write redo log records; all steps asynchronous No data block writes (checkpoint, cache replacement) 6X more log writes, but 9X less network traffic Tolerant of network and storage outlier latency
Observations
27,378K transactions 35X MORE
950K I/Os per 1M txns (6X amplification) 7.7X LESS
Performance
IO traffic in Aurora
LOG RECORDS
Primary Instance
INCOMING QUEUE
STORAGE NODE
S3 BACKUP
1
2
3
4
5
6
7
8
UPDATE QUEUE
ACK
HOTLOG
DATABLOCKS
POINT IN TIMESNAPSHOT
GC
SCRUBCOALESCE
SORTGROUP
PEER TO PEER GOSSIPPeerStorageNodes
All steps are asynchronous Only steps 1 and 2 are in foreground latency path Input queue is 46X less than MySQL (unamplified, per node) Favor latency-sensitive operations Use disk space to buffer against spikes in activity
Observations
IO Flow1. Receive record and add to in-memory queue2. Persist record and ACK 3. Organize records and identify gaps in log4. Gossip with peers to fill in holes5. Coalesce log records into new data block versions6. Periodically stage log and new block versions to S37. Periodically garbage collect old versions8. Periodically validate CRC codes on blocks
IO traffic in Aurora (Storage Node)
MySQL Master
30% Read
70% Write
MySQL Replica
30% New Reads
70% Write
SINGLE-THREADEDBINLOG APPLY
Data Volume Data Volume
Logical: Ship SQL statements to Replica
Write workload similar on both instances
Independent storage
Can result in data drift between Master and Replica
Physical: Ship redo from Master to Replica
Replica shares storage. No writes performed
Cached pages have redo applied
Advance read view when all commits seen
PAGE CACHEUPDATE
Aurora Master
30% Read
70% Write
Aurora Replica
100% New Reads
Shared Multi-AZ Storage
IO traffic in Aurora ReplicasMy SQL read scaling Amazon Aurora read scaling
“In MySQL, we saw replica lag spike to almost 12 minutes which is almost absurd from an application’s perspective. With Aurora, the maximum read replica lag across 4 replicas never exceeded 20 ms.”
Real-life data - read replica latency
Read
Write
Commit
Read
Read
T1
Commi t (T1)
Commi t (T2)
Commit (T3)
LSN 10
LSN 12
LSN 22
LSN 50
LSN 30
LSN 34
LSN 41
LSN 47
LSN 20
LSN 49
Commit (T4)
Commit (T5)Commit (T6)
Commit (T7)
Commi t (T8)
LSN GROWTHDurable LSN at head-node
COMMIT QUEUEPending commits in LSN order
TIME
GROUPCOMMIT
Transactions
Read
Write
Commit
Read
Read
T1
Read
Write
Commit
Read
Read
Tn
Traditional Approach Amazon Aurora Maintain a buffer of log records to write out to disk
Issue write when buffer full or time out waiting for writes
First writer has latency penalty when write rate is low
Request I/O with first write, fill buffer till write picked up
Individual write durable when 4 of 6 storage nodes ACK
Advance DB Durable point up to earliest pending ACK
Asynchronous group commits
Re-entrant connections multiplexed to active threads
Kernel-space epoll() inserts into latch-free event queue
Dynamically size threads pool
Gracefully handles 5000+ concurrent client sessions on r3.8xl
Standard MySQL – one thread per connection
Doesn’t scale with connection count
MySQL EE – connections assigned to thread group
Requires careful stall threshold tuning
Clie
nt c
onne
ctio
n
Clie
nt c
onne
ctio
n LATCH FREETASK QUEUE
epol
l()
MySQL thread model Aurora thread model
Adaptive thread pool
Scan
Delete
Scan
Delete
Insert
Scan Scan
Insert
Delete
Scan
Insert
Insert
MySQL lock manager Aurora lock manager
Same locking semantics as MySQL Concurrent access to lock chains Multiple scanners allowed in an individual lock chains Lock-free deadlock detection
Needed to support many concurrent sessions, high update throughput
Aurora lock management
New Performance Enhancements
Catalog concurrency: Improved data dictionary synchronization and cache eviction.
NUMA aware scheduler: Aurora scheduler is now NUMA aware. Helps scale on multi-socket instances.
Read views: Aurora now uses a latch-free concurrent read-view algorithm to construct read views.
MySQL 5.6 MySQL 5.7 Aurora 2015 Aurora 20160
100
200
300
400
500
600
700
In thousands of read requests/sec* R3.8xlarge instance, <1GB dataset using Sysbench
25% Throughput gain
Cached read performance
Smart scheduler: Aurora scheduler now dynamically assigns threads between IO heavy and CPU heavy workloads.
Smart selector: Aurora reduces read latency by selecting the copy of data on a storage node with best performance
Logical read ahead (LRA): We avoid read IO waits by prefetching pages based on their order in the btree.
MySQL 5.6 MySQL 5.7 Aurora 2015 Aurora 20160
20
40
60
80
100
120
In thousands of requests/sec* R3.8xlarge instance, 1TB dataset using Sysbench
Non-cached read performance
10% Throughput gain
Scan
Delete
Scan
Delete
Insert
Scan Scan
Insert
Delete
Scan
Insert
Insert
Hot row contentionMySQL lock manager Aurora lock manager
Highly contended workloads had high memory and CPU 1.9 (Nov) – lock compression (bitmap for hot locks) 1.9 – replace spinlocks with blocking futex – up to 12x reduction in CPU, 3x improvement in throughput December – use dynamic programming to release locks: from O(totalLocks * waitLocks) to O(totalLocks)
Throughput on Percona TPC-C 100 improved 29x (from 1,452 txns/min to 42,181 txns/min)
MySQL 5.6 MySQL 5.7 Aurora Improvement
500 connections 6,093 25,289 73,955 2.92x
5000 connections 1,671 2,592 42,181 16.3x
Percona TPC-C – 10GB
* Numbers are in tpmC, measured using release 1.10 on an R3.8xlarge, MySQL numbers using RDS and EBS with 30K PIOPS
MySQL 5.6 MySQL 5.7 Aurora Improvement
500 connections 3,231 11,868 70,663 5.95x
5000 connections 5,575 13,005 30,221 2.32x
Percona TPC-C – 100GB
Hot row contention
Accelerates batch inserts sorted by primary key – works by caching the cursor position in an index traversal.
Dynamically turns itself on or off based on data pattern
Avoids contention in acquiring latches while navigating down the tree.
Bi-directional, works across all insert statements
LOAD INFILE, INSERT INTO SELECT, INSERT INTO REPLACE and, Multi-value inserts.
Index
R4 R5R2 R3R0 R1 R6 R7 R8
Index
Root
Index
R4 R5R2 R3R0 R1 R6 R7 R8
Index
Root
MySQL: Traverses B-tree starting from root for all inserts
Aurora: Inserts avoids index traversal;
Insert performance
MySQL 5.6 leverages Linux read ahead – but this requires consecutive block addresses in the btree. It inserts entries top down into the new btree, causing splits and lots of logging.
Aurora’s scan pre-fetches blocks based on position in tree, not block address.
Aurora builds the leaf blocks and then the branches of the tree.
No splits during the build.
Each page touched only once.
One log record per page.
2-4X better than MySQL 5.6 or MySQL 5.7r3.large on
10GB datasetr3.8xlarge on 10GB dataset
r3.8xlarge on 100GB dataset
0
2
4
6
8
10
12
RDS MySQL 5.6 RDS MySQL 5.7 Aurora 2016H
ours
Faster index build
Need to store and reason about spatial data E.g. “Find all people within 1 mile of a hospital” Spatial data is multi-dimensional B-Tree indexes are one-dimensional
Aurora supports spatial data types (point/polygon) GEOMETRY data types inherited from MySQL 5.6 This spatial data cannot be indexed
Two possible approaches: Specialized access method for spatial data (e.g., R-Tree) Map spatial objects to one-dimensional space & store in
B-Tree - space-filling curve using a grid approximation
AB
A A
A A
A A A
B B
BB
B
A COVERS BB COVEREDBY A
A CONTAINS BB INSIDE A
A TOUCH BB TOUCH A
A OVERLAPBDYINTERSECT BB OVERLAPBDYINTERSECT A
A OVERLAPBDYDISJOINT BB OVERLAPBDYDISJOINT A
A EQUAL BB EQUAL A
A DISJOINT BB DISJOINT A
A COVERS BB ON A
Why Spatial Index
Z-index used in Aurora
Challenges with R-Trees Keeping it efficient while balanced Rectangles should not overlap or cover empty space Degenerates over time Re-indexing is expensive
R-Tree used in MySQL 5.7
Z-index (dimensionally ordered space filling curve) Uses regular B-Tree for storing and indexing Removes sensitivity to resolution parameter Adapts to granularity of actual data without user declaration Eg GeoWave (National Geospatial-Intelligence Agency)
Spatial Indexes in Aurora
. . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . .. .
. . . . . . . . . . . . .
* r3.8xlarge using Sysbench on <1GB dataset* Write Only: 4000 clients, Select Only: 2000 clients, ST_EQUALS
Spatial Index BenchmarksSysbench – points and polygons
Availability
“Performance only matters if your database is up”
Storage volume automatically grows up to 64 TB
Quorum system for read/write; latency tolerant
Peer to peer gossip replication to fill in holes
Continuous backup to S3 (built for 11 9s durability)
Continuous monitoring of nodes and disks for repair
10GB segments as unit of repair or hotspot rebalance
Quorum membership changes do not stall writes
AZ 1 AZ 2 AZ 3
Amazon S3
Storage Durability
Aurora cluster contains primary node and up to fifteen replicas
Failing database nodes are automatically detected and replaced
Failing database processes are automatically detected and recycled
Customer application may scale-out read traffic across replicas
Replica automatically promoted on persistent outage
AZ 1 AZ 3AZ 2
PrimaryNodePrimaryNodePrimaryNode
PrimaryNodePrimaryNode
SecondaryNode
PrimaryNodePrimaryNode
SecondaryNode
More Replicas
Segment snapshot Log records
Recovery point
Segment 1
Segment 2
Segment 3
Time
Continuous Backup
Take periodic snapshot of each segment in parallel; stream the redo logs to Amazon S3
Backup happens continuously without performance or availability impact
At restore, retrieve the appropriate segment snapshots and log streams to storage nodes
Apply log streams to segment snapshots in parallel and asynchronously
Traditional Databases Have to replay logs since the last checkpoint Typically 5 minutes between checkpoints Single-threaded in MySQL; requires a large
number of disk accesses
Amazon Aurora Underlying storage replays redo records
on demand as part of a disk read Parallel, distributed, asynchronous No replay for startup
Checkpointed Data Redo Log
Crash at T0 requiresa re-application of theSQL in the redo log sincelast checkpoint
T0 T0
Crash at T0 will result in redo logs being applied to each segment on demand, in parallel, asynchronously
Instant Crash Recovery
We moved the cache out of the database process
Cache remains warm in the event of database restart
Lets you resume fully loaded operations much faster
Instant crash recovery + survivable cache = quick and easy recovery from DB failures
SQLTransactions
Caching
SQL
Transactions
Caching
SQLTransactions
Caching
Caching process is outside the DB process and remains warm across a database restart
Survivable Caches
AppRunningFailure Detection DNS Propagation
Recovery Recovery
DBFailure
MYSQL
AppRunning
Failure Detection DNS Propagation
Recovery
DBFailure
AURORA WITH MARIADB DRIVER
15-20 sec
3-20 sec
Faster Fail-Over
1 4 7 10 13 16 19 22 25 28 31 340.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
0 - 5s – 30% of fail-overs
1 4 7 10 13 16 19 22 25 28 31 340.00%5.00%
10.00%15.00%20.00%25.00%30.00%35.00%40.00%45.00%50.00%
5 - 10s – 40% of fail-overs
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 350%
10%
20%
30%
40%
50%
60%
10 - 20s – 25% of fail-overs
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 350%
5%
10%
15%
20%
20 - 30s – 5% of fail-overs
Database fail-over time
New Availability Enhancements
You also incur availability disruptions when you1. Patch your database software
2. Modify your database schema
3. Perform large scale database reorganizations
4. DBA errors requiring database restores
Availability is about more than HW failures
Net
wor
king
stat
e
App
licat
ion
stat
e
Storage ServiceApp state
Net state
App state
Net state
Bef
ore
ZD
P
New DB Engine
Old DB Engine
New DB Engine
Old DB Engine
With
ZD
P
User sessions terminate during patching
User sessions remain active through patching
Storage Service
Zero downtime patching
We have to go to our current patching model when we can’t park connections: Long running queries Open transactions Bin-log enabled Parameter changes pending Temporary tables open Locked Tables SSL connections open Read replicas instancesWe are working on addressing the above.
Zero downtime patching – current constraints
Create a copy of a database without duplicate storage costs Creation of a clone is nearly instantaneous –
we don’t copy data Data copy happens only on write – when
original and cloned volume data differ
Typical use cases: Clone a production DB to run tests Reorganize a database Save a point in time snapshot for analysis
without impacting production system.
Production database
Clone Clone
CloneDev/test
applications
Benchmarks
Production applications
Production applications
Database cloning
Page 1
Page 2
Page 3
Page 4
Source Database
Page 1
Page 3
Page 2
Page 4
Cloned database
Shared Distributed Storage System: physical pages
Both databases reference same pages on the shared distributed storage system
Page 1
Page 2
Page 3
Page 4
How does it work?
Page 1
Page 2
Page 3
Page 4
Page 5
Page 1
Page 3
Page 5
Page 2
Page 4
Page 6
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
As databases diverge, new pages are added appropriately to each database while still referencing pages common to both databases
Page 2
Page 3
Page 5
Shared Distributed Storage System: physical pages
Source Database Cloned database
How does it work? (contd.)
Full Table copy; rebuilds all indexes – can take hours or days to complete.
Needs temporary space for DML operations DDL operation impacts DML throughput Table lock applied to apply DML changes
Index
LeafLeafLeaf Leaf
Index
Roottable name operation column-name time-stamp
Table 1Table 2Table 3
add-coladd-coladd-col
column-abccolumn-qprcolumn-xyz
t1t2t3
We add an entry to the metadata table and use schema versioning to decode the blo.
Added a modify-on-write primitive to upgrade the block to the latest schema when it is modified.
Currently support add NULLable column at end of table Priority is to support other add column, drop/reortder,
modify datatypes.
MySQL Amazon Aurora
Online DDL: Aurora vs. MySQL
On r3.large
On r3.8xlarge
Aurora MySQL 5.6 MySQL 5.7
10GB table 0.27 sec 3,960 sec 1,600 sec
50GB table 0.25 sec 23,400 sec 5,040 sec
100GB table 0.26 sec 53,460 sec 9,720 sec
Aurora MySQL 5.6 MySQL 5.7
10GB table 0.06 sec 900 sec 1,080 sec
50GB table 0.08 sec 4,680 sec 5,040 sec
100GB table 0.15 sec 14,400 sec 9,720 sec
Online DDL performance
t0 t1 t2
t0 t1
t2
t3 t4
t3
t4
Rewind to t1
Rewind to t3
Invisible Invisible
What is online point in time restore?
Online point in time restore is a quick way to bring the database to a particular point in time without having to restore from backups Rewinding the database to quickly recover from unintentional DML/DDL operations Rewind multiple times to determine the desired point in time in the database state. For example quickly iterate
over schema changes without having to restore multiple times
Online PiTR
Online PiTR operation changes the state of the current DB
Current DB is available within seconds, even for multi terabyte DBs
No additional storage cost as current DB is restored to prior point in time
Multiple iterative online PiTRs are practical
Rewind has to be within the allowed rewind period based on purchased rewind storage
Cross-region online PiTR is not supported
Offline PiTR
PiTR creates a new DB at desired point in time from the backup of the current DB
New DB instance takes hours to restore for multi terabyte DBs
Each restored DB is billed for its own storage
Multiple iterative offline PiTRs is time consuming
Offline PiTR has to be within the configured backup window or from snapshots
Aurora supports cross-region PiTR
Online vs Offline Point in Time Restore (PiTR)
Segment snapshot Log records
Rewind Point
Segment 1
Segment 2
Segment 3
Time
Storage Segments
How does it work?
Take periodic snapshots within each segment in parallel and store them locally At rewind time, each segment picks the previous local snapshot and applies the log streams to the snapshot to
produce the desired state of the DB
t0 t1 t2
t0 t1
t2
t3 t4
t3
t4
Rewind to t1
Rewind to t3
Invisible Invisible
How does it work? (contd.)
Logs within the log stream are made visible or invisible based on the branch within the LSN tree to provide a consistent view for the DB The first rewind performed at t2 to rewind the DB to t1 makes the logs in purple color invisible The second rewind performed at time t4 to rewind the DB to t3 makes the logs in red and purple invisible
Removing Blockers
Amazon Aurora PostgreSQL compatibility now in preview
Same underlying scale out, 3 AZ, 6 copy, fault tolerant, self healing, expanding database optimized storage tier
Integrated with a PostgreSQL 9.6 compatible database
Logging + Storage
SQL
Transactions
Caching
Amazon S3
My applications require PostgreSQL
T2 RI discountsUp to 34% with a 1-year RIUp to 57% with a 3-year RI
vCPU Mem Hourly Price
db.t2.medium 2 4 $0.082
db.r3.large 2 15.25 $0.29
db.r3.xlarge 4 30.5 $0.58
db.r3.2xlarge 8 61 $1.16
db.r3.4xlarge 16 122 $2.32
db.r3.8xlarge 32 244 $4.64
T2.Small coming in Q1. 2017
*Prices are for Virginia
An R3.Large is too expensive for my use case
Amazon Aurora gives each database instance IP firewall protection
Aurora offers transparent encryption at rest and SSL protection for data in transit
Amazon VPC lets you isolate and control network configuration and connect securely to your IT infrastructure
AWS Identity and Access Management provides resource-level permission controls
*New* *New*
My databases need to meet certifications
MariaDB server audit plugin Aurora native audit support
We can sustain over 500K events/sec
Create event string
DDL
DML
Query
DCL
Connect
Write to File
DDL
DML
Query
DCL
Connect
Create event string
Create event string
Create event string
Create event string
Create event string
Latch-free queue
Write to File
Write to File
Write to File
MySQL 5.7 Aurora
Audit Off 95K 615K 6.47x
Audit On 33K 525K 15.9x
Sysbench Select-only Workload on 8xlarge Instance
Aurora Auditing
Lambda
S3
IAM
CloudWatch
Generate Lambda event from Aurora stored procedures.
Load data from S3, store snapshots and backups in S3.
Use IAM roles to manage database access control.
Upload systems metrics and audit logs to CloudWatch.
*NEW*
Q1
AWS ecosystem
Business Intelligence Data Integration Query and Monitoring
“We ran our compatibility test suites against Amazon Aurora and everything just worked." - Dan Jewett, VP, Product Management at Tableau
MySQL 5.6 / InnoDB compatible No application compatibility issues
reported in last 18 months
MySQL ISV applications run pretty much as is
Working on 5.7 compatibility Running a bit slower than expected
– hope to make it available soon
Back ported 81 fixes from different MySQL releases;
MySQL compatibility
Available now Available in Dec Available in Q1
Performance
Availability
Security
Ecosystem
PCI/DSSHIPPA/BAA
Fast online schema changeManaged MySQL to Aurora replication
Cross-region snapshot copyOnline Point in Time RestoreDatabase cloningZero-downtime patching
Spatial indexingLock compressionReplace spinlocks with blocking futexFaster index build
Aurora auditing
IAM Integration
Copy-on-write volume
T2.Medium T2.SmallCloudwatch for metrics, audit
Timeline
Top Related