AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Steve Hunt - Director of Infrastructure, FanDuel

Alan Murray - Director of Architecture, FanDuel

Robin Spira - CTO, FanDuel

Darin Briskman, AWS Database Services

DAT307

Introduction to Managed

Database Services on AWS

AWS Database Landscape

Relational

Amazon Aurora

Amazon RDS

AWS Data Migration

Big Data & Integration

Amazon EMR

Amazon Redshift

AWS Data Pipeline

NoSQL & In-Memory

Amazon DynamoDB

Amazon ElastiCache

Analytics & Search

Amazon QuickSight

Amazon ElasticsearchService

Amazon CloudSearch

Amazon Machine Learning

Amazon Kinesis, AWS Lambda, Amazon S3

Hosting your databases on-premises

Power, HVAC, net

Rack & stack

Server maintenance

OS patches

DB s/w patches

Database backups

Scaling

High availability

DB s/w installs

OS installation

App optimization

you

Hosting your databases in Amazon EC2

Power, HVAC, net

Rack & stack

Server maintenance

OS installation

OS patches

DB s/w patches

Database backups

Scaling

High availability

DB s/w installs

App optimization

you

Using an AWS Managed Database Service

App optimization

Power, HVAC, net

Rack & stack

Server maintenance

OS patches

DB s/w patches

Database backups

High availability

DB s/w installs

OS installation

Scaling

you

No infrastructure

management

Scale up/downCost-effective

Instant provisioningApplication

compatibility

Amazon Relational Database Service (Amazon RDS)

Amazon RDS for Aurora

• MySQL compatible with 5x better performance on the same

hardware: 100,000 writes/sec & 500,000 reads/sec

• Scalable with up to 64TB in single database, up to 15 read

replicas

• Highly available, durable, and fault-tolerant custom SSD

storage layer: 6 way replicated across 3 AZs

• Transparent encryption for data at rest using AWS KMS

• Stored procedures in Amazon Aurora can invoke AWS

Lambda functions

High availability with Aurora

• Aurora cluster contains primary

node and up to 15 secondary

nodes

• Failing database nodes are

automatically detected and

replaced

• Failing database processes are

automatically detected and recycled

• Secondary nodes automatically

promoted on persistent outage, no

single point of failure

• Customer application can scale out

read traffic across secondary nodes

AZ 1 AZ 3AZ 2

Primary

NodePrimary

NodePrimary

Node

Primary

NodePrimary

NodeSecondary

Node

Primary

NodePrimary

NodeSecondary

Node

PG&E: Aurora in Action

Servicing high-traffic surge during power events

had always been a problem.

Availability is critical when databases are down; it

adversely affects service to gas and electrical

customers.

Aurora benefits:

Ability to create multiple database replicas with

millisecond latency allows handling large surges in

traffic while giving customers timely, up-to-date

information during a power event .

Amazon Aurora’s 6-way replication, self-healing

storage and automatic instance repair provide the

availability and reliability needed for mission critical

applications.

One of the largest combination natural gas

and electric utilities in the United States

with approximately 16 million customers in

70,000-square-mile service area in northern

and central California.

Database Conversion Capabilities with

Amazon Database Migration Service

Source Database Target Database

Microsoft SQL Server Amazon Aurora, MySQL, PostgreSQL

MySQL PostgreSQL

Oracle Amazon Aurora, MySQL, PostgreSQL

Oracle Data Warehouse Amazon Redshift

PostgreSQL Amazon Aurora, MySQL

Teradata Amazon Redshift

Relational and NoSQL

Optimized for storage Optimized for compute

Normalized/relational Denormalized/hierarchical

Ad hoc queries Instantiated views

Scale vertically Scale horizontally

Mature solution Emerging technology

Relational NoSQL

DynamoDB: Non-Relational Managed NoSQL

Database Service

• Schema-less data model

• Consistent low latency performance (single digit ms)

• Predictable provisioned throughput

• Seamless Scalability

• No storage limits

• High durability and availability

replication between 3 facilities

• Easy Administration – We scale for you!

• Low Cost

• Cost modelling on throughput and size

Amazon

DynamoDB

DynamoDB Scalability

• No throughput limit

• No storage limit

• DynamoDB automatically partitions

data when:

• Data set grows

• Provisioned capacity increases partitions1 .. N

table

DynamoDB Durability

WRITES3-way replication

Quorum acknowledgment

Persisted to disk (custom SSD)

READSStrongly or eventually consistent

No trade-off in latency

Nexon Scales Mobile Gaming with Amazon DynamoDB

Nexon is a leading South Korean video game developer

and a pioneer in the world of interactive entertainment.

By using AWS, we

decreased our initial

investment costs, and only

pay for what we use.

Chunghoon Ryu

Department Manager, Nexon

”

“ • Nexon uses Amazon DynamoDB as its

primary game database for the blockbuster

mobile game, Heroes of Incredible Tales

• HIT became the #1 Mobile Game in Korea

within the first day of launch and has > 2M

registered users

• Nexon’s HIT leverages DynamoDB to

deliver steady latency of less than 10ms to

deliver a fantastic mobile gaming

experience for 170,000 concurrent players

In-Memory Key-Value Store

High-performance

Redis and Memcached

Fully managed; Zero admin

Highly Available and Reliable

Hardened by Amazon

Redis – the fast in-memory database

Powerful ~200 commands + Lua scripting

In-memory database

Utility data structuresstrings, lists, hashes, sets, sorted sets,

bitmaps & HyperLogLogs

Simple

Atomic operationssupports transactions

has ACID properties

Ridiculously fast!<500microsecond latency for

most commands

Highly Availablereplication

Persistentsnapshots

Open Source

Expedia’s Real-time Analytics Application Uses Amazon ElastiCache

Expedia is a leader in the $1 trillion travel industry, with an extensive portfolio

that includes some of the world’s most trusted travel brands.

With ElastiCache Redis as

caching layer, the write

throughput on DynamoDB has

been set to 3500, down from

35000, reducing the cost by 6x.

Kuldeep Chowhan

Engineering Manager, Expedia

”

“ • Expedia’s real-time analytics application

collects data for its “test & learn”

experiments on Expedia sites.

• The analytics application processes ~200

million messages daily.

• Bursts of write data were causing throttling,

requiring high write provisioning

• Implementing ElastiCache eliminated

redundant writes, allowing a 90% reduction

in provisioned write capacity

Relational data warehouse

Massively parallel; petabyte scale

Fully managed

HDD and SSD platforms

Amazon

Redshift

a lot faster

a lot simpler

a lot cheaper

The Amazon Redshift view of data warehousing

10x cheaper

Easy to provision

Higher DBA productivity

10x faster

No programming

Easily leverage BI tools,

Hadoop, machine learning,

streaming

Analysis inline with process

flows

Pay as you go, grow as you

need

Managed availability and

disaster recovery

Enterprise Big data SaaS

Distributed search and analytics engine

Managed service using Elasticsearch and Kibana

Fully managed; Zero admin

Highly Available and Reliable

Tightly integrated with other AWS servicesAmazon

Elasticsearch

Service

Case Study: McGraw Hill Education

Over 100 million learning events each month• Tests, quizzes, learning modules begun / completed /

abandoned

Supporting a wide catalog across multiple services in multiple jurisdictions

Analyzing student test results, student/teacher interaction, teacher effectiveness, student progress

Integrating analytics of applications and infrastructure to understand operations in real time

Amazon EMR

• Managed platform

• MapReduce, Apache Spark, Presto

• Launch a cluster in minutes

• Storage with S3, HDFS, or MapR

• Leverage the elasticity of the cloud

• Baked in security features

• Pay by the hour and save with Spot

• Flexibility to customize

Amgen Using EMR and Cloud Capacity

Amazon.com Confidential

#SPORTSRICH

Managing Data with AWS

AWS reInvent 2016

Robin Spira

CTO

[email protected]

Alan Murray

Director, Architecture

[email protected]

Steve Hunt

Director, Infrastructure

[email protected]

● Founders launched HubDub in 2007

● Pivoted to FanDuel at SXSW in 2009

● FanDuel defines and refines Daily Fantasy Sports

● Mobile apps launched in 2014

● 6 million registered users by end of 2015

● 4 million app downloads by end of 2015

● Mobile apps become primary interface in 2016

● Friends Mode launched in 2016

A BRIEF HISTORY OF [FANDUEL] TIME

DAILY FANTASY SPORTS ROUND UP

● One night stand Fantasy Sports

● Variety of Sports

• NFL, NBA, NHL, MLB

• English and European Soccer

● Salary cap format

• Every athlete has an assigned value

• Skill is picking the best team within budget

● Range of contests:

• Head-to-Heads, 50/50, Multipliers, Beginner contests

• Million dollar prize pool Tournaments

● Friends Mode

• Play against friends all season for cash or bragging rights

THE EVOLUTIONARY BLUEPRINT

● Year-on-year growth in terms of:

• Active users

• Deposit value & frequency

• Revenue

• Brand awareness

● Over 6m users, 4m app downloads

BUSINESS GROWTH 1x

PLATFORM INCEPTION - 2009 1x

● Founder code running on terrestrial, co-lo LAMP stack

○ Monolithic PHP codebase

○ Single, monolithic MySQL instance

○ Schema not designed for data at scale

● Organic, time-pressured growth

○ In-the-moment decision making

○ Limited code re-factoring

○ Reliance on database layer to provide scaling

■ Server running at 90%+ of compute capacity on any given NFL Sunday

○ Co-location not easily scalable

● Summary: Scaling was a real problem

PLATFORM EVOLUTION - 2010 5x

● Migration to cloud provider - non-AWS

● Scalability limitations

○ Infrastructure

○ Application

○ Network

● Application layer changes made to enable horizontal scaling

● Monolithic database still there!


● Migration to Amazon Web Services

○ Scalability limitations still present

○ Brown-outs common on NFL Sundays


○ Self-managed MySQL installation on EC2

○ Single master

○ Multiple read replicas, statement-based replication

■ Relatively clean separation of read and write concerns

● Drawbacks:

○ Extra master compute load

○ Replication lag

○ Addition of read replicas time consuming and error prone


● Move to Service-Oriented Architecture

○ Creation of Java service components with their own data stores

○ Message queues used for decoupling

● Started to use Puppet

● Elastic-all-the-things!

○ ELBs examined in earnest for stateless horizontal scalability

○ Adoption of ASGs to fulfil scaling requirements

○ Immediate benefits:

■ Fault tolerance & Service availability

■ Cost management

● But: Monolithic database still there!

○ Provisioned IOPs - peaking at 8,000 at one point


● Business volumes now 625x

● REST-based API layer created to underpin iOS mobile app

○ Time pressure, little consideration given to data provisioning

○ Continued work on back-end services

● The great NFL blackout: Service down on the opening day!

○ Cascade failure: High contention = table lock catastrophe

○ Significant business impact

● Immediate work undertaken to improve scalability and DR

position

○ Migration to active / passive multi-master configuration

■ Higher availability but still below what was possible

■ Better data backup option● All because: Monolithic database still there!

PLATFORM EVOLUTION - 2014 3,125x

● Business volumes now 3,125x

● Database made it through the season!

● Core internal API service launched

• Web re-platform to use new API

● iOS mobile app released

● Service degradation during NFL week 2

• redis backing store failure


PLATFORM EVOLUTION - 2015 15,625x

● Massive acquisition campaign

○ Business volumes now very high

○ Paying out over $2 billion in winnings

● Formation of a dedicated architecture function

● Wide adoption of Infrastructure as code

○ CloudFormation provisioning via SparkleFormation

● More service-specific data stores

● Migration from multi-master, self-hosted MySQL on EC2 to RDS for

MySQL

○ No more manual management of replication and backups

○ Far easier to create and provision read replicas

○ No more hand-crafted EC2s or complex cluster management

2016: WHERE IS OUR PLATFORM NOW? 16,000x

● Very reliable

● Scalable with significant headroom

● Far better recovery position

○ Fewer service degradations

● Focus on broad-endpoint performance testing

○ Significant operational headroom

● Focus on tech debt reduction

● Focus on architectural innovation

● Still… that legacy monolithic database

○ Why not use DynamoDB?

2016: WHERE IS OUR PLATFORM?

THE DATA MANAGEMENT PROBLEM

THURSDAY

SUNDAY

● Operational model - service must be available!

● Seasonal - mostly predictable

○ Traffic spikes follow real sporting fixtures

○ Traffic goes from near zero to EVERYTHING NOW!

● Platform must be scaled in advance of game start

○ Users can’t try again later

○ Legal obligations

○ Any service degradation == greatly degraded business

ENTRY & EDIT

● Hard stop at game start

○ Entry value - substantial sums of money involved

○ We cannot overfill contests

● ACID compliance is very important

○ We perform atomic, distributed transactions

○ CAP constraints - we compromise Availability to maintain data consistency

● Remember that monolithic database?

○ During our peak in 2015:

■ ~ 1,000 INSERTs per second

■ ~ 200 UPDATEs per second

● Replication lag could still be a problem

○ Grace period between game lock and game start

MODIFICATIONS PER SECOND

ENTRIESLINEUPS

LIVE SCORING

● Requires a consistent view of the data - immediately!

● Starts immediately after the massive entry / edit peak

● Not strictly a critical path activity

○ Our users disagreed!

● Live Scoring has two scalability concerns

○ Back end load - maintain scoring state

○ Front end load - delivery to users

CONTESTS SCORED PER SECOND

SCORING VIEWS PER SECOND

SETTLEMENT

● Tally up scores and pay out prizes!

● Involves aggregation of huge amounts of data

○ For just one slate of fixtures in NFL 2015:

■ ~ 10 million SELECTs

■ ~ 55 million INSERTs / UPDATEs

○ Constrained execution; limited time window

● On self-managed EC2 running MySQL, after tuning: ~ 5 hour execution time

● After the move to RDS for MySQL: ~ 1 hour 20 minutes execution time

WHAT DID WE LEARN ON THIS JOURNEY?

● Number one data concern: Consistency, Consistency, Consistency!

● Replication lag became more of an issue as business volume grew

● Maintenance of the multi-master cluster and other database services

○ Failover & Recovery

■ Would love RPO = 0; RTO = ASAP

○ Manual and prone to human error

○ Time consuming

THAT FIRST STEP: SELF-HOSTED TO RDS FOR MySQL

● Move from self-hosted to RDS for MySQL

○ Couldn’t adopt Amazon Aurora; still in Beta

● Greatly reduced levels of undifferentiated heavy lifting!

○ Automated failover!

○ Better RPO and RTO

○ Common interface / interaction

○ Replication lag greatly reduced

● My colleague Steve will share his migration experiences shortly

THE SECOND STEP: TESTING AMAZON AURORA

● FanDuel tested Aurora soon after Beta release

● Found a handful of minor teething issues

○ Data import was initially slow; migration tools now optimal

○ Odd edge cases - disparate MTU from EC2 instances

● After minor teething issues fixed - everything was great!

OUR FIRST AURORA PRODUCTION EXPERIENCE

● Our first Aurora production service - ACTION LOG

○ Append only log of EVERY user action

○ Fairly insane INSERT throughput required; rarely read from

○ Used by customer service; to verify user activity

○ Needs to be consistent - usually at time critical points!

● Benefits

○ READ-AFTER-WRITE consistency - Goodbye replication lag!

○ Worst case, we have seen 3x - 5x performance uplift

○ Given us significant headroom

○ All Java core service persistence layers now Aurora

SELF-HOSTED REDIS TO ELASTICACHE

● Our Live Scoring service is backed by redis

● We have a number of redis-backed service components

○ WeatherCache, PlayerCache, User tokens, Global Session...

○ Up to 2015, these were self-managed on EC2

● We moved one redis-backed service component to Aurora

● Benefits - no manual intervention

○ RTO & RPO improvements

○ Managed master failover; quick promotion time

○ Operational simplicity, uniform interfaces and tooling

● Using AWS tooling

• Step 1. Configure DMS

• Step 2. ??

• Step 3. Profit!

HOW DID WE RUN THE MIGRATION EVENTS?

Source

(Active)

Applications

Target

(Passive)

● Demote the old, promote the new

master

● Setup replication

• optionally setup active/passive

replication

● Copy the data

• mysqldump --master-data=(1|2)

● Check consistency

● Build and warmup the target architecture

● Switch the application

● With zero data loss and minimal downtime

● Using familiar steps and tooling

Target

(Active)

Source

(Passive)

Applications

SUMMARY

● Benefits of moving to AWS Managed Database Services

○ One more time: Automated failover!

○ One more time: Managed backups!

○ And no more replication lag!

○ Common interfaces and tooling - Simplicity is GOOD!

○ Operational cost reduction

● Acknowledgements

○ FanDuel Engineering - this team is awesome!

○ #ENGINEERINGRICH

● And in all of this, let’s not forget Amazon Redshift!

So … How Do I Choose?

It depends! The Four Key Attributes

1. Volume

2. Velocity

3. Variety

4. Vicinity

Decision Factor: Volume

How much data do you need online?

• ElastiCache for Redis < 3.5TB

• RDS SQL Server < 4TB

• RDS MySQL, MariaDB, PostgreSQL <6TB

• RDS Aurora < 64TB

• DynamoDB (no limit)

• Elasticsearch < 150TB

• Redshift < 2PB

• EMR (no limit)

Decision Factor: Velocity

How fast does your data need to move?

• ElastiCache for Redis = µs to ms

• RDS = ms to seconds

• Dynamo = ms (at any scale)

• Redshift = seconds to minutes

• Elasticsearch = seconds to minutes

• EMR = minutes to hours

Decision Factor: Variety

How big is the range of data types?

• RDS and Redshift are relational, so schema is enforced by

the database

• DynamoDB and ElastiCache are NoSQL, so schema is fluid

• Elasticsearch is a search engine, so it can autodetect

schemas

Decision Factor: Vicinity

What other applications and tools need to be near the

data?

• Do applications set certain database requirements?

• Are there compliance requirements to meet?

The Elephant Hiding in the Room

Source: Vikram Gupchup, CC by-sa, Dec 2011

$

https://creativecommons.org/licenses/by-sa/3.0/deed.en

Another Example: Asynchronous Online Gaming

• Highly available

• Elastic scalability

• Millions of

Players

Thank you!

Remember to complete

your evaluations!

Related SessionsARC311 - Evolving a Responsive and Resilient

Architecture to Analyze Billions of Metrics

ARC403 - Building a Massive Microservices Gaming

Platform Worthy of the Game of Thrones

ARC404 - Migrating a Highly Available and Scalable

Database from Oracle to Amazon DynamoDB Just

Added! In this session, we share

BDA304 - What’s New with Amazon Redshift

BDM302 - Real-Time Data Exploration and Analytics with

Amazon Elasticsearch Service and Kibana

BDM402 - Best Practices for Data Warehousing with

Amazon Redshift

DAT203 - Getting Started with Amazon Aurora

DAT301 - Amazon Aurora Best Practices: Getting the

Best Out of Your Databases

DAT302 - Best Practices for Migrating from Commercial

Database Engines to Amazon Aurora or

PostgreSQL

DAT303 - Deep Dive on Amazon Aurora

DAT304 - Deep Dive on Amazon DynamoDB

DAT305 - Deep Dive on Amazon Relational Database

Service

DAT306 - ElastiCache Deep Dive: Best Practices and

Usage Patterns

DAT314 - How Woot Migrated from MongoDB to Fully

Managed, Secure, and Scalable AWS Services

DAT318 - Migrating from RDBMS to NoSQL: How Sony

Moved from MySQL to Amazon DynamoDB

DAT320 - AWS Database State of the Union

DAT322 - Workshop: Stretching Scalability: Doing more

with Amazon Aurora

WIN203 - How Pitney Bowes is transforming their

business in the cloud

…. And many more!!

AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)

Technology

Transcript of AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)