AWS to Bare Metal: Motivation, Pitfalls, and Results
-
Upload
mongodb -
Category
Technology
-
view
136 -
download
0
Transcript of AWS to Bare Metal: Motivation, Pitfalls, and Results
AWS CLOUDTO
BARE METAL
Wish saved 35% on MongoDB costs
Improved latency by 20%
And reduced latency variance
HI, I’M ADAM.(I’m a software engineer; I also run production…)
I WORK AT WISH.(we’re a mobile eCommerce platform)
I WORK AT WISH.(we also grow really fast…)
AWS TO BARE METAL• The Why
• The Scope
• The Servers
• The Network
• The Operations
• The Results
THE THEME
The Why
there was spinning disk EBS
In the beginning
DB slows to a crawl
Replica set detects failureElection kills the app for 30s
App slows down
EBS LATENCY SPIKE
Provisioned IOPS EBS launches
Summer 2012
But - super expensive!
Maybe time for bare metal?
So we modeled the costs…
The Scope
?
The Servers
Server Specs?
GOAL
Find lowest cost per query
for your workload
THROUGHPUT & LATENCY
• Typically: more throughput → more latency
• Application dictates max latency (p95?)
• For each hardware config…
• Find highest throughput under max latency
THE WORKLOAD
• db.setProfilingLevel(2)
• Snapshot the DB volume
• Dump system.profile after 1 hour
OUR TOOL
• Restore the snapshot
• Clear filesystem caches
• Replay ops at configured throughput
• Report on latency / MongoDB stats
LATEST SPECS
• 2x Ivy Bridge 3.3 GHz (32 hyperthreads)
• 256 GB RAM
• 3.2 TB LSI WarpDrive PCI-e
YOUR M
ILEAGE M
AY VARY
!
The Network
NETWORKS ARE WEIRD
• Network engineering is weird for software people
• Need to master a few, big pieces
• We wasted a lot of time improvising…
PLAN TO FAIL• Every component and connection fails
• Switch dies?
• NIC dies?
• Switch ⟷ switch connection dies?
• DirectConnect dies?
The Operations
THE OPERATIONS
• Migration / Rollback• Backups• Processes• Documentation
MIGRATION (PREP)
• Add new nodes to replica set
• hidden: true, priority: 0
• Wait for them to sync
MIGRATION (READ-ONLY)
• Unhide nodes:
• hidden: false, priority: 0
MIGRATION (READ-WRITE)
• Force primary into colo:
• hidden: false, priority: 2
MIGRATION (DONE)
• Hide old AWS nodes:
• hidden: true, priority: 0
ROLLBACK
• No big deal
• Adjust hidden/priority to move traffic back
BACKUPS
• EBS snapshots rock!
• Hidden member in EC2 for backup
• Nice for DR too…
PROCESSES
• No RackServer() API
• Ensure consistency:
• Checklists
• Verification tools
DOCUMENTATION
• No DescribeInstances either…
• Consider life without AWS Management Console
• Worse: consider it being occasionally wrong
DOCUMENTATION
• Wiremaps
• Network maps (IPs, VLANs, etc)
• Equipment specs
• Serial numbers
The Results
Big project - took about 6 months
Savings made it worthwhile
Bonus: it got faster!
Budget a lot of time for learning
Benchmark & validate your assumptions
Obsess over the details
Thanks!