D *ËhJ) -1-1, 100B) +1,100B) · D *ËhJ) -1-1, 100B) +1,100B) Created Date: 4/24/2020 10:53:40 AM
Getting 100B Metrics to Disk
-
Upload
jthurman42 -
Category
Technology
-
view
4.842 -
download
0
description
Transcript of Getting 100B Metrics to Disk
![Page 1: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/1.jpg)
G E T T I N G 1 0 0 B M E T R I C S T O D I S KJonathan Thurman -Site Reliability Engineer @jthurman42
1 9 4 B
http://www.flickr.com/photos/meteopassione/9157134653/
![Page 2: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/2.jpg)
N E W R E L I C
• Performance Monitoring
• Web Apps
• Mobile Apps
• Servers
• Databases, Caches & More…
• Software Analytics
![Page 3: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/3.jpg)
O K AY, Y O U C O L L E C T D ATA
• 194 Billion Metrics
• 100,000 req/sec
• 2 Gbps Inbound
• 216 Terabytes
• All backed my MySQL
http://www.flickr.com/photos/bobsfever/6658919861/
![Page 4: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/4.jpg)
H O W W E G O T H E R E
http://www.flickr.com/photos/auvet/853157494/
![Page 5: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/5.jpg)
B U I L D I N G B L O C K S
• Hosted Environment
• Xen Virtual Machines
• Data storage
• ATA over Ethernet
• SATA drives
• MySQL 5.0
• Single Ruby on Rails Application
http://www.flickr.com/photos/riekhavoc/4648423297/
![Page 6: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/6.jpg)
S H A R D I N G F R O M I N C E P T I O N
• Account Information
• Read heavy
• Single HA Instance
• Agent Data
• Write heavy
• 8 shards based on AccountId
http://www.flickr.com/photos/erikb/48221952/
![Page 7: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/7.jpg)
TA L E O F T W O M O D E L S
• Ruby on Rails
• class ShardData < ActiveRecord::Base
• Look up shard for Account
• Override ConnectionHandler
http://www.flickr.com/photos/jungle_boy/140279885/
![Page 8: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/8.jpg)
![Page 9: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/9.jpg)
T R I B B L E S TA B L E S
• Metric table name contains
• AccountID
• Year and Julian Day
• Resolution
• ts_72_13221_1h
• Currently ~200k tables per DB
http://www.flickr.com/photos/15942690@N00/4571141076/
![Page 10: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/10.jpg)
B I N G E A N D P U R G E
• Purging data
• DELETE FROM …
• DROP TABLE …
• innodb_file_per_table
• innodb_lazy_drop_table (pre 5.5.30-30.2)
http://www.flickr.com/photos/exalthim/2261294871/
![Page 11: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/11.jpg)
http://www.flickr.com/photos/davidmonro/8331755849/
http://www.flickr.com/photos/heliocentric/1571127347/
http://www.flickr.com/photos/aigle_dore/6225535459/
![Page 12: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/12.jpg)
G R O W I N G PA I N S
http://www.flickr.com/photos/aigle_dore/5626285743/
![Page 13: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/13.jpg)
M U LT I P L E P O I N T S O F FA I L U R E
• Single shard slows down
• App servers wait for response
• DB connection pool becomes full
• Site goes down
http://www.flickr.com/photos/boston_public_library/8204384670/
![Page 14: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/14.jpg)
S H A R D G U A R D
• Monitor all databases
• Identify shard status:
• Bad? Mark as “wedged”
• Good? Clear “wedged” flag
• ShardData checks status!
http://www.flickr.com/photos/mac_filko/5486980804/
![Page 15: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/15.jpg)
S TA B I L I T Y A N D P E R F O R M A N C E
• Degraded performance
• New Accounts => Shard 9!
• Old accounts remain as-is
http://www.flickr.com/photos/ejpphoto/7823027272/
![Page 16: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/16.jpg)
D ATA C O L L E C T I O N
• Rails isn’t great for data collection
• Ruby isn’t great either…
• Rewritten in Java using Jetty
http://www.flickr.com/photos/autograt/224540606/
![Page 17: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/17.jpg)
C A C H E I S K I N G
• Buffered, not queued
• RAM is cheaper than I/O
• Get creative with batch processing
http://www.flickr.com/photos/epsos/8474532085/
![Page 18: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/18.jpg)
I N S E R T I N T O ( S E L E C T …
• Select rows and re-process
• Cache last hour in Java’s Heap
• Write a journal and post-process it
http://www.flickr.com/photos/esoteric_13/4741001804/
![Page 19: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/19.jpg)
R E A D / W R I T E P R O B L E M
• Sequential Inserts
• Batched in 5k chunks
• Optimize for Throughput
• Must complete < 1 minute
![Page 20: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/20.jpg)
R E A D / W R I T E P R O B L E M
• Scattered Reads
• Optimized for Latency
• Unique Covering Indexes
![Page 21: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/21.jpg)
M O V E T O H A R D W A R E
• Instant performance!
• Just add…
• Datacenter - Chicago, US
• Servers - Dell
• Storage - Direct Attached
• Time - About 6 months
http://www.flickr.com/photos/zebble/9621007/
![Page 22: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/22.jpg)
S P I N N I N G R U S T
• Dell MD1200 shelves
• 8 Disks per shelf
• RAID 5 virtual disk
• Dedicated Hot-spare
http://www.flickr.com/photos/walkn/5472536812/
![Page 23: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/23.jpg)
T H E G R E AT E X PA N S E
• MD1200s support 12 disks
• Add four more!
• Online RAID expansion
http://www.flickr.com/photos/aigle_dore/5853807037/
![Page 24: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/24.jpg)
# FA I L
• “On-line” expansion, not so much
• Added second 4 disk RAID 5
• LVM Concatenation for space
http://www.flickr.com/photos/fireflythegreat/2845637227/
![Page 25: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/25.jpg)
N E E D M O R E C A PA C I T Y
• Tight on disk space
• Performance not an issue
• New Accounts => Shard 10!
• Old Accounts as-is
http://www.flickr.com/photos/seandreilinger/6289721616/
![Page 26: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/26.jpg)
![Page 27: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/27.jpg)
S H A R D P I T FA L L S
http://www.flickr.com/photos/21206761@N00/469110140/
![Page 28: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/28.jpg)
M I G R AT I O N P R O B L E M
• Accounts cannot move
• Not all tables have the shard key
• Rails defaults to auto-increment IDs
• Massive primary key collisions
• Punt and move the metrics
http://www.flickr.com/photos/tzafrir/125380911/
![Page 29: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/29.jpg)
B R E A K I N G U P I S H A R D T O D O
• Agent Databases
• Metadata / Notes / Errors
• Timeslice Databases
• Time-series metric data
• 1 Minute and 1 Hour resolution
http://www.flickr.com/photos/rsepulveda/4275236049/
![Page 30: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/30.jpg)
![Page 31: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/31.jpg)
R E S O U R C E P O O L S
• Distributed by Shard Key
• Distribution can CHANGE
• Lookup table, not hash
• Data can be MOVED
http://www.flickr.com/photos/dclark3996/4971906528/
![Page 32: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/32.jpg)
B A C K U P S
• Custom mysqldump wrapper
• Based on business need
• Backup per table
• Ignore tables to be purged
http://www.flickr.com/photos/usdagov/6896218334/
![Page 33: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/33.jpg)
E V O L U T I O N
http://www.flickr.com/photos/pfsullivan_1056/3485953405/
![Page 34: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/34.jpg)
S S D R E V O L U T I O N
• 600GB Intel 320 SSDs
• Dell MD1220 Direct Attached shelf
• Disks are no longer the bottle-neck
• Inserts in Read-optimized order are “fast enough”
![Page 35: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/35.jpg)
Y O U C A N U S E S S D W I T H D ATA B A S E S
• 6 of 420 drives RMA’d
• March 2012 to Aug 2013
• Average 180TB lifetime writes
• 91% wear remaining
http://www.flickr.com/photos/joeshlabotnik/3584172834/
![Page 36: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/36.jpg)
R E D U N D A N T A R R AY O F E X P E N S I V E D I S K S
• Rebuilds under load > 4 hours
• Migrated to RAID 60
• 2 x 12 disk span
• Ditch the Hot-spares
http://www.flickr.com/photos/mbk/27640225/
![Page 37: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/37.jpg)
X F S T U N I N G
• mkfs.xfs -s size=4096
• options
• noatime
• nobarrier
• inode64
• logbsize=256k
http://www.flickr.com/photos/rocketlass/5169004165/
![Page 38: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/38.jpg)
S H A R D G U A R D PA R T D E U X
• Protect all the things!
• Kill UI queries over 75 seconds
• Kill background queries over 1 hour
• Yes, all of them
• No really, kill them, now
http://www.flickr.com/photos/chiky/7194089194/
![Page 39: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/39.jpg)
I F Y O U D O N ’ T B E L I E V E M E …
• Delayed Job
• Long running background query
• InnoDB History List Traversal
![Page 40: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/40.jpg)
T O I N F I N I T Y A N D B E Y O N D
http://www.flickr.com/photos/temma2/1149223191/
![Page 41: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/41.jpg)
H A R D W A R E V 2
• Dell R620
• 2 x Intel E5-2690 @ 2.90GHz
• 96GB RAM
• MD1220 Storage Shelf
• 800GB Intel SSD S3500
http://www.flickr.com/photos/tnarik/2590037637/
![Page 42: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/42.jpg)
C O N T I N U O U S I M P R O V E M E N T
• EXT4 / ZFS / XFS
• RAID Card vs HBA
• Percona Server 5.6
• Multiple MySQL Instances
• Databases per Service
http://www.flickr.com/photos/shawnclover/8555834230/
![Page 43: Getting 100B Metrics to Disk](https://reader035.fdocuments.net/reader035/viewer/2022062404/5537eedf550346b82d8b46ec/html5/thumbnails/43.jpg)
JOIN THE TEAM NewRelic.com/jobs