AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)
-
Upload
amazon-web-services -
Category
Technology
-
view
295 -
download
1
Transcript of AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Steve Hunt - Director of Infrastructure, FanDuel
Alan Murray - Director of Architecture, FanDuel
Robin Spira - CTO, FanDuel
Darin Briskman, AWS Database Services
DAT307
Introduction to Managed
Database Services on AWS
AWS Database Landscape
Relational
Amazon Aurora
Amazon RDS
AWS Data Migration
Big Data & Integration
Amazon EMR
Amazon Redshift
AWS Data Pipeline
NoSQL & In-Memory
Amazon DynamoDB
Amazon ElastiCache
Analytics & Search
Amazon QuickSight
Amazon ElasticsearchService
Amazon CloudSearch
Amazon Machine Learning
Amazon Kinesis, AWS Lambda, Amazon S3
Hosting your databases on-premises
Power, HVAC, net
Rack & stack
Server maintenance
OS patches
DB s/w patches
Database backups
Scaling
High availability
DB s/w installs
OS installation
App optimization
you
Hosting your databases in Amazon EC2
Power, HVAC, net
Rack & stack
Server maintenance
OS installation
OS patches
DB s/w patches
Database backups
Scaling
High availability
DB s/w installs
App optimization
you
Using an AWS Managed Database Service
App optimization
Power, HVAC, net
Rack & stack
Server maintenance
OS patches
DB s/w patches
Database backups
High availability
DB s/w installs
OS installation
Scaling
you
No infrastructure
management
Scale up/downCost-effective
Instant provisioningApplication
compatibility
Amazon Relational Database Service (Amazon RDS)
Amazon RDS for Aurora
• MySQL compatible with 5x better performance on the same
hardware: 100,000 writes/sec & 500,000 reads/sec
• Scalable with up to 64TB in single database, up to 15 read
replicas
• Highly available, durable, and fault-tolerant custom SSD
storage layer: 6 way replicated across 3 AZs
• Transparent encryption for data at rest using AWS KMS
• Stored procedures in Amazon Aurora can invoke AWS
Lambda functions
High availability with Aurora
• Aurora cluster contains primary
node and up to 15 secondary
nodes
• Failing database nodes are
automatically detected and
replaced
• Failing database processes are
automatically detected and recycled
• Secondary nodes automatically
promoted on persistent outage, no
single point of failure
• Customer application can scale out
read traffic across secondary nodes
AZ 1 AZ 3AZ 2
Primary
NodePrimary
NodePrimary
Node
Primary
NodePrimary
NodeSecondary
Node
Primary
NodePrimary
NodeSecondary
Node
PG&E: Aurora in Action
Servicing high-traffic surge during power events
had always been a problem.
Availability is critical when databases are down; it
adversely affects service to gas and electrical
customers.
Aurora benefits:
Ability to create multiple database replicas with
millisecond latency allows handling large surges in
traffic while giving customers timely, up-to-date
information during a power event .
Amazon Aurora’s 6-way replication, self-healing
storage and automatic instance repair provide the
availability and reliability needed for mission critical
applications.
One of the largest combination natural gas
and electric utilities in the United States
with approximately 16 million customers in
70,000-square-mile service area in northern
and central California.
Database Conversion Capabilities with
Amazon Database Migration Service
Source Database Target Database
Microsoft SQL Server Amazon Aurora, MySQL, PostgreSQL
MySQL PostgreSQL
Oracle Amazon Aurora, MySQL, PostgreSQL
Oracle Data Warehouse Amazon Redshift
PostgreSQL Amazon Aurora, MySQL
Teradata Amazon Redshift
Relational and NoSQL
Optimized for storage Optimized for compute
Normalized/relational Denormalized/hierarchical
Ad hoc queries Instantiated views
Scale vertically Scale horizontally
Mature solution Emerging technology
Relational NoSQL
DynamoDB: Non-Relational Managed NoSQL
Database Service
• Schema-less data model
• Consistent low latency performance (single digit ms)
• Predictable provisioned throughput
• Seamless Scalability
• No storage limits
• High durability and availability
replication between 3 facilities
• Easy Administration – We scale for you!
• Low Cost
• Cost modelling on throughput and size
Amazon
DynamoDB
DynamoDB Scalability
• No throughput limit
• No storage limit
• DynamoDB automatically partitions
data when:
• Data set grows
• Provisioned capacity increases partitions1 .. N
table
DynamoDB Durability
WRITES3-way replication
Quorum acknowledgment
Persisted to disk (custom SSD)
READSStrongly or eventually consistent
No trade-off in latency
Nexon Scales Mobile Gaming with Amazon DynamoDB
Nexon is a leading South Korean video game developer
and a pioneer in the world of interactive entertainment.
By using AWS, we
decreased our initial
investment costs, and only
pay for what we use.
Chunghoon Ryu
Department Manager, Nexon
”
“ • Nexon uses Amazon DynamoDB as its
primary game database for the blockbuster
mobile game, Heroes of Incredible Tales
• HIT became the #1 Mobile Game in Korea
within the first day of launch and has > 2M
registered users
• Nexon’s HIT leverages DynamoDB to
deliver steady latency of less than 10ms to
deliver a fantastic mobile gaming
experience for 170,000 concurrent players
In-Memory Key-Value Store
High-performance
Redis and Memcached
Fully managed; Zero admin
Highly Available and Reliable
Hardened by Amazon
Redis – the fast in-memory database
Powerful ~200 commands + Lua scripting
In-memory database
Utility data structuresstrings, lists, hashes, sets, sorted sets,
bitmaps & HyperLogLogs
Simple
Atomic operationssupports transactions
has ACID properties
Ridiculously fast!<500microsecond latency for
most commands
Highly Availablereplication
Persistentsnapshots
Open Source
Expedia’s Real-time Analytics Application Uses Amazon ElastiCache
Expedia is a leader in the $1 trillion travel industry, with an extensive portfolio
that includes some of the world’s most trusted travel brands.
With ElastiCache Redis as
caching layer, the write
throughput on DynamoDB has
been set to 3500, down from
35000, reducing the cost by 6x.
Kuldeep Chowhan
Engineering Manager, Expedia
”
“ • Expedia’s real-time analytics application
collects data for its “test & learn”
experiments on Expedia sites.
• The analytics application processes ~200
million messages daily.
• Bursts of write data were causing throttling,
requiring high write provisioning
• Implementing ElastiCache eliminated
redundant writes, allowing a 90% reduction
in provisioned write capacity
Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI tools,
Hadoop, machine learning,
streaming
Analysis inline with process
flows
Pay as you go, grow as you
need
Managed availability and
disaster recovery
Enterprise Big data SaaS
Distributed search and analytics engine
Managed service using Elasticsearch and Kibana
Fully managed; Zero admin
Highly Available and Reliable
Tightly integrated with other AWS servicesAmazon
Elasticsearch
Service
Case Study: McGraw Hill Education
Over 100 million learning events each month• Tests, quizzes, learning modules begun / completed /
abandoned
Supporting a wide catalog across multiple services in multiple jurisdictions
Analyzing student test results, student/teacher interaction, teacher effectiveness, student progress
Integrating analytics of applications and infrastructure to understand operations in real time
Amazon EMR
• Managed platform
• MapReduce, Apache Spark, Presto
• Launch a cluster in minutes
• Storage with S3, HDFS, or MapR
• Leverage the elasticity of the cloud
• Baked in security features
• Pay by the hour and save with Spot
• Flexibility to customize
Amgen Using EMR and Cloud Capacity
Amazon.com Confidential
#SPORTSRICH
Managing Data with AWS
AWS reInvent 2016
Robin Spira
CTO
Alan Murray
Director, Architecture
Steve Hunt
Director, Infrastructure
● Founders launched HubDub in 2007
● Pivoted to FanDuel at SXSW in 2009
● FanDuel defines and refines Daily Fantasy Sports
● Mobile apps launched in 2014
● 6 million registered users by end of 2015
● 4 million app downloads by end of 2015
● Mobile apps become primary interface in 2016
● Friends Mode launched in 2016
A BRIEF HISTORY OF [FANDUEL] TIME
DAILY FANTASY SPORTS ROUND UP
● One night stand Fantasy Sports
● Variety of Sports
• NFL, NBA, NHL, MLB
• English and European Soccer
● Salary cap format
• Every athlete has an assigned value
• Skill is picking the best team within budget
● Range of contests:
• Head-to-Heads, 50/50, Multipliers, Beginner contests
• Million dollar prize pool Tournaments
● Friends Mode
• Play against friends all season for cash or bragging rights
THE EVOLUTIONARY BLUEPRINT
● Year-on-year growth in terms of:
• Active users
• Deposit value & frequency
• Revenue
• Brand awareness
● Over 6m users, 4m app downloads
BUSINESS GROWTH 1x
PLATFORM INCEPTION - 2009 1x
● Founder code running on terrestrial, co-lo LAMP stack
○ Monolithic PHP codebase
○ Single, monolithic MySQL instance
○ Schema not designed for data at scale
● Organic, time-pressured growth
○ In-the-moment decision making
○ Limited code re-factoring
○ Reliance on database layer to provide scaling
■ Server running at 90%+ of compute capacity on any given NFL Sunday
○ Co-location not easily scalable
● Summary: Scaling was a real problem
PLATFORM EVOLUTION - 2010 5x
● Migration to cloud provider - non-AWS
● Scalability limitations
○ Infrastructure
○ Application
○ Network
● Application layer changes made to enable horizontal scaling
● Monolithic database still there!
PLATFORM EVOLUTION - 2011 25x
● Migration to Amazon Web Services
○ Scalability limitations still present
○ Brown-outs common on NFL Sundays
● Monolithic database still there!
○ Self-managed MySQL installation on EC2
○ Single master
○ Multiple read replicas, statement-based replication
■ Relatively clean separation of read and write concerns
● Drawbacks:
○ Extra master compute load
○ Replication lag
○ Addition of read replicas time consuming and error prone
PLATFORM EVOLUTION - 2012 125x
● Move to Service-Oriented Architecture
○ Creation of Java service components with their own data stores
○ Message queues used for decoupling
● Started to use Puppet
● Elastic-all-the-things!
○ ELBs examined in earnest for stateless horizontal scalability
○ Adoption of ASGs to fulfil scaling requirements
○ Immediate benefits:
■ Fault tolerance & Service availability
■ Cost management
● But: Monolithic database still there!
○ Provisioned IOPs - peaking at 8,000 at one point
PLATFORM EVOLUTION - 2013 625x
● Business volumes now 625x
● REST-based API layer created to underpin iOS mobile app
○ Time pressure, little consideration given to data provisioning
○ Continued work on back-end services
● The great NFL blackout: Service down on the opening day!
○ Cascade failure: High contention = table lock catastrophe
○ Significant business impact
● Immediate work undertaken to improve scalability and DR
position
○ Migration to active / passive multi-master configuration
■ Higher availability but still below what was possible
■ Better data backup option● All because: Monolithic database still there!
PLATFORM EVOLUTION - 2014 3,125x
● Business volumes now 3,125x
● Database made it through the season!
● Core internal API service launched
• Web re-platform to use new API
● iOS mobile app released
● Service degradation during NFL week 2
• redis backing store failure
● Monolithic database still there!
PLATFORM EVOLUTION - 2015 15,625x
● Massive acquisition campaign
○ Business volumes now very high
○ Paying out over $2 billion in winnings
● Formation of a dedicated architecture function
● Wide adoption of Infrastructure as code
○ CloudFormation provisioning via SparkleFormation
● More service-specific data stores
● Migration from multi-master, self-hosted MySQL on EC2 to RDS for
MySQL
○ No more manual management of replication and backups
○ Far easier to create and provision read replicas
○ No more hand-crafted EC2s or complex cluster management
2016: WHERE IS OUR PLATFORM NOW? 16,000x
● Very reliable
● Scalable with significant headroom
● Far better recovery position
○ Fewer service degradations
● Focus on broad-endpoint performance testing
○ Significant operational headroom
● Focus on tech debt reduction
● Focus on architectural innovation
● Still… that legacy monolithic database
○ Why not use DynamoDB?
2016: WHERE IS OUR PLATFORM?
THE DATA MANAGEMENT PROBLEM
THURSDAY
SUNDAY
● Operational model - service must be available!
● Seasonal - mostly predictable
○ Traffic spikes follow real sporting fixtures
○ Traffic goes from near zero to EVERYTHING NOW!
● Platform must be scaled in advance of game start
○ Users can’t try again later
○ Legal obligations
○ Any service degradation == greatly degraded business
ENTRY & EDIT
● Hard stop at game start
○ Entry value - substantial sums of money involved
○ We cannot overfill contests
● ACID compliance is very important
○ We perform atomic, distributed transactions
○ CAP constraints - we compromise Availability to maintain data consistency
● Remember that monolithic database?
○ During our peak in 2015:
■ ~ 1,000 INSERTs per second
■ ~ 200 UPDATEs per second
● Replication lag could still be a problem
○ Grace period between game lock and game start
MODIFICATIONS PER SECOND
ENTRIESLINEUPS
LIVE SCORING
● Requires a consistent view of the data - immediately!
● Starts immediately after the massive entry / edit peak
● Not strictly a critical path activity
○ Our users disagreed!
● Live Scoring has two scalability concerns
○ Back end load - maintain scoring state
○ Front end load - delivery to users
CONTESTS SCORED PER SECOND
SCORING VIEWS PER SECOND
SETTLEMENT
● Tally up scores and pay out prizes!
● Involves aggregation of huge amounts of data
○ For just one slate of fixtures in NFL 2015:
■ ~ 10 million SELECTs
■ ~ 55 million INSERTs / UPDATEs
○ Constrained execution; limited time window
● On self-managed EC2 running MySQL, after tuning: ~ 5 hour execution time
● After the move to RDS for MySQL: ~ 1 hour 20 minutes execution time
WHAT DID WE LEARN ON THIS JOURNEY?
● Number one data concern: Consistency, Consistency, Consistency!
● Replication lag became more of an issue as business volume grew
● Maintenance of the multi-master cluster and other database services
○ Failover & Recovery
■ Would love RPO = 0; RTO = ASAP
○ Manual and prone to human error
○ Time consuming
THAT FIRST STEP: SELF-HOSTED TO RDS FOR MySQL
● Move from self-hosted to RDS for MySQL
○ Couldn’t adopt Amazon Aurora; still in Beta
● Greatly reduced levels of undifferentiated heavy lifting!
○ Automated failover!
○ Better RPO and RTO
○ Common interface / interaction
○ Replication lag greatly reduced
● My colleague Steve will share his migration experiences shortly
THE SECOND STEP: TESTING AMAZON AURORA
● FanDuel tested Aurora soon after Beta release
● Found a handful of minor teething issues
○ Data import was initially slow; migration tools now optimal
○ Odd edge cases - disparate MTU from EC2 instances
● After minor teething issues fixed - everything was great!
OUR FIRST AURORA PRODUCTION EXPERIENCE
● Our first Aurora production service - ACTION LOG
○ Append only log of EVERY user action
○ Fairly insane INSERT throughput required; rarely read from
○ Used by customer service; to verify user activity
○ Needs to be consistent - usually at time critical points!
● Benefits
○ READ-AFTER-WRITE consistency - Goodbye replication lag!
○ Worst case, we have seen 3x - 5x performance uplift
○ Given us significant headroom
○ All Java core service persistence layers now Aurora
SELF-HOSTED REDIS TO ELASTICACHE
● Our Live Scoring service is backed by redis
● We have a number of redis-backed service components
○ WeatherCache, PlayerCache, User tokens, Global Session...
○ Up to 2015, these were self-managed on EC2
● We moved one redis-backed service component to Aurora
● Benefits - no manual intervention
○ RTO & RPO improvements
○ Managed master failover; quick promotion time
○ Operational simplicity, uniform interfaces and tooling
● Using AWS tooling
• Step 1. Configure DMS
• Step 2. ??
• Step 3. Profit!
HOW DID WE RUN THE MIGRATION EVENTS?
Source
(Active)
Applications
Target
(Passive)
● Demote the old, promote the new
master
● Setup replication
• optionally setup active/passive
replication
● Copy the data
• mysqldump --master-data=(1|2)
● Check consistency
● Build and warmup the target architecture
● Switch the application
● With zero data loss and minimal downtime
● Using familiar steps and tooling
Target
(Active)
Source
(Passive)
Applications
SUMMARY
● Benefits of moving to AWS Managed Database Services
○ One more time: Automated failover!
○ One more time: Managed backups!
○ And no more replication lag!
○ Common interfaces and tooling - Simplicity is GOOD!
○ Operational cost reduction
● Acknowledgements
○ FanDuel Engineering - this team is awesome!
○ #ENGINEERINGRICH
● And in all of this, let’s not forget Amazon Redshift!
So … How Do I Choose?
It depends! The Four Key Attributes
1. Volume
2. Velocity
3. Variety
4. Vicinity
Decision Factor: Volume
How much data do you need online?
• ElastiCache for Redis < 3.5TB
• RDS SQL Server < 4TB
• RDS MySQL, MariaDB, PostgreSQL <6TB
• RDS Aurora < 64TB
• DynamoDB (no limit)
• Elasticsearch < 150TB
• Redshift < 2PB
• EMR (no limit)
Decision Factor: Velocity
How fast does your data need to move?
• ElastiCache for Redis = µs to ms
• RDS = ms to seconds
• Dynamo = ms (at any scale)
• Redshift = seconds to minutes
• Elasticsearch = seconds to minutes
• EMR = minutes to hours
Decision Factor: Variety
How big is the range of data types?
• RDS and Redshift are relational, so schema is enforced by
the database
• DynamoDB and ElastiCache are NoSQL, so schema is fluid
• Elasticsearch is a search engine, so it can autodetect
schemas
Decision Factor: Vicinity
What other applications and tools need to be near the
data?
• Do applications set certain database requirements?
• Are there compliance requirements to meet?
The Elephant Hiding in the Room
Source: Vikram Gupchup, CC by-sa, Dec 2011
$
Another Example: Asynchronous Online Gaming
• Highly available
• Elastic scalability
• Millions of
Players
Thank you!
Remember to complete
your evaluations!
Related SessionsARC311 - Evolving a Responsive and Resilient
Architecture to Analyze Billions of Metrics
ARC403 - Building a Massive Microservices Gaming
Platform Worthy of the Game of Thrones
ARC404 - Migrating a Highly Available and Scalable
Database from Oracle to Amazon DynamoDB Just
Added! In this session, we share
BDA304 - What’s New with Amazon Redshift
BDM302 - Real-Time Data Exploration and Analytics with
Amazon Elasticsearch Service and Kibana
BDM402 - Best Practices for Data Warehousing with
Amazon Redshift
DAT203 - Getting Started with Amazon Aurora
DAT301 - Amazon Aurora Best Practices: Getting the
Best Out of Your Databases
DAT302 - Best Practices for Migrating from Commercial
Database Engines to Amazon Aurora or
PostgreSQL
DAT303 - Deep Dive on Amazon Aurora
DAT304 - Deep Dive on Amazon DynamoDB
DAT305 - Deep Dive on Amazon Relational Database
Service
DAT306 - ElastiCache Deep Dive: Best Practices and
Usage Patterns
DAT314 - How Woot Migrated from MongoDB to Fully
Managed, Secure, and Scalable AWS Services
DAT318 - Migrating from RDBMS to NoSQL: How Sony
Moved from MySQL to Amazon DynamoDB
DAT320 - AWS Database State of the Union
DAT322 - Workshop: Stretching Scalability: Doing more
with Amazon Aurora
WIN203 - How Pitney Bowes is transforming their
business in the cloud
…. And many more!!