Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015
-
Upload
couchbase -
Category
Technology
-
view
92 -
download
2
Transcript of Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015
Benjamin (Jerry) FranzSr Site Reliability Engineer
Scalability and Performance
Lessons learned taking the second highest QPS Couchbase server at LinkedIn from zero to
awesome
Couchbase
We have aCouchbase cluster?
Day 1
Meet the Couchbase Cluster• Three parallel clusters of 16 machines• 64 Gbytes of RAM per machine• 1 TB of RAID1 (spinning drives) per machine• 6 buckets in each cluster
Meet the Couchbase Cluster• Three parallel clusters of 16 machines• 64 Gbytes of RAM per machine• 1 TB of RAID1 (spinning drives) per machine• 6 buckets in each cluster• Massively under resourced• Memory completely full• 1 node failed out in each parallel cluster• Disk I/O Utilization: 100%, all the time• No Alerting
Meet the Couchbase Cluster• Three parallel clusters of 16 machines• 64 Gbytes of RAM per machine• 1 TB of RAID1 (spinning drives) per machine• 6 buckets in each cluster• Massively under resourced• Memory completely full• 1 node failed out in each parallel cluster• Disk I/O Utilization: 100%, all the time• No Alerting
The immediate problems• Unable to store new data because the memory was full and there wasn’t enough
I/O capacity available to flush it to disk.
The immediate problems• Unable to store new data because the memory was full and there wasn’t enough
I/O capacity available to flush it to disk.
• Aggravated by nodes failing out of the cluster - reducing both available memory and disk IOPS even further.
The immediate problems• Unable to store new data because the memory was full and there wasn’t enough
I/O capacity available to flush it to disk.
• Aggravated by nodes failing out of the cluster - reducing both available memory and disk IOPS even further.
• There was no visibility into cluster health because1. We didn’t know what healthy metrics should look like for Couchbase. We
didn’t even know which metrics were most important.2. Alerts were not being sent even when the cluster was in deep trouble.
The First Aid• Configured alerting
The First Aid• Configured alerting
• Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.
The First Aid• Configured alerting.
• Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.
• Doubled the number of nodes (from 48 to 96) in the clusters to improve available memory and disk IOPS.
The First Aid• Configured alerting.
• Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.
• Doubled the number of nodes (from 48 to 96) in the clusters to improve available memory and disk IOPS.
• Increased the disk fragmentation threshold for compaction from 30% to 65% to reduce disk I/O.
The First Aid• Configured alerting.
• Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.
• Doubled the number of nodes (from 48 to 96) in the clusters to improve available memory and disk IOPS.
• Increased the disk fragmentation threshold for compaction from 30% to 65% to reduce disk I/O.
• Reduced metadata expiration time from 3 days to 1 day to free memory.
Node failouts - Solved• The node failouts had two interacting causes:
1. Linux Transparent HugePages were active on many nodes causing semi-random slowdowns of nodes lasting up to several minutes when memory was defragged - causing them to fail out of the cluster. Fixed by correcting the kernel settings and restarting the nodes.
Node failouts - Solved• The node failouts had two interacting causes:
1. Linux Transparent HugePages were active on many nodes causing semi-random slowdowns of nodes lasting up to several minutes when memory was defragged - causing them to fail out of the cluster. Fixed by correcting the kernel settings and restarting the nodes.
2. ‘Pre-failure’ drives were going into data recovery mode and causing failouts on the affected nodes during the nightly access log scan at 10:00 UTC (02:00 PST/03:00 PDT).
Disk Persistence – Not solved• Despite more than doubling the available system resources, tuning
filesystem options for performance, and slashing the amount of data being fed to it by the application, disk utilization remained stubbornly close to 100% and the disk queues were still growing.
Disk Persistence – Not solved• Despite more than doubling the available system resources, tuning
filesystem options for performance, and slashing the amount of data being fed to it by the application, disk utilization remained stubbornly close to 100% and the disk queues were still growing.
• The cluster had a huge amount of ‘hidden I/O demand’. Because a large fraction of the data had very short TTLs but was taking up to a day to persist to disk, it was expiring in queue before it could be persisted. This was actually doing quite a lot to keep the cluster from falling over completely as it acted to throttle the disk demand as the cluster became overloaded. We were actually now persisting twice as much data as before.
Disk Persistence – Not solved• Despite more than doubling the available system resources, tuning
filesystem options for performance, and slashing the amount of data being fed to it by the application, disk utilization remained stubbornly close to 100% and the disk queues were still growing.
• The cluster had a huge amount of ‘hidden I/O demand’. Because a large fraction of the data had very short TTLs but was taking up to a day to persist to disk, it was expiring in queue before it could be persisted. This was actually doing quite a lot to keep the cluster from falling over completely as it acted to throttle the disk demand as the cluster became overloaded. We were actually now persisting twice as much data as before.
Cluster Health Visibility – Solved• Alerts were being sent to the appropriate people - cluster was not suffering
from outages without notices.
• Critical cluster metrics were identified and being used to for health monitoring and to measure performance tuning improvements.
Cluster Health Visibility – SolvedThe most important performance metrics
• ep_diskqueue_items – The number of items waiting to be persisted. This should be a somewhat stable number day to day. If if has a persistently upward trend that means cluster is unable to keep up with its disk I/O requirements.
Cluster Health Visibility – SolvedThe most important performance metrics
• ep_diskqueue_items – The number of items waiting to be persisted. This should be a somewhat stable number day to day. If if has a persistently upward trend that means cluster is unable to keep up with its disk I/O requirements.
• ep_storage_age – The age of the most recently persisted item. This has been a critical metric for quantifying the effects of configuration changes. A healthy cluster should keep this number close to or below 1 second on average. We started with values approaching days.
Cluster Health Visibility – SolvedThe most important performance metrics
• ep_diskqueue_items – The number of items waiting to be persisted. This should be a somewhat stable number day to day. If if has a persistently upward trend that means cluster is unable to keep up with its disk I/O requirements.
• ep_storage_age – The age of the most recently persisted item. This has been a critical metric for quantifying the effects of configuration changes. A healthy cluster should keep this number close to or below 1 second on average. We started with values approaching days.
• vb_active_perc_mem_resident – The percentage of items in the RAM cache. For most clusters at LinkedIn it should be 100%. If it falls below that, the cluster is probably underprovisioned and taking a big performance hit.
Overall Status Update
Servers. Lots of servers.My best estimate was that to meet our I/O requirements we would have to at least double our total node count again, to 192 servers total (3 x 64).
This was getting expensive
Servers. Lots of servers.My best estimate was that to meet our I/O requirements we would have to at least double our total node count again, to 192 servers total (3 x 64).
This was getting expensive
Servers. Lots of servers.My best estimate was that to meet our I/O requirements we would have to at least double our total node count again, to 192 servers total (3 x 64).
This was getting expensive
It was time to change up my strategy
SSDs
SSDsInitial Integration Testing
• 2 x 550GB Virident SSDs were integrated into one of the sub-clusters
SSDsInitial Integration Testing
• 2 x 550GB Virident SSDs were integrated into one of the sub-clusters
• Reduced cluster to 16 nodes to test under heavier load
SSDsInitial Integration Testing
• 2 x 550GB Virident SSDs were integrated into one of the sub-clusters
• Reduced cluster to 16 nodes to test under heavier load
• Write I/O on SSDs shot up to multiple times of the rate of the HDDs
SSDsInitial Integration Testing
• 2 x 550GB Virident SSDs were integrated into one of the sub-clusters
• Reduced cluster to 16 nodes to test under heavier load
• Write I/O on SSDs shot up to multiple times of the rate of the HDDs
• Performance scaling indicated that in the final configuration we would burn through the SSD lifetime write capacity in less than one year
SSDs
SSD Strategy Tuning
• Switched to 2200 GB Virident SSDs to extend service life
• Reduced cluster size to 8 nodes per sub-cluster (24 nodes total)
Full Scale SSD Impact
Full Scale SSD ImpactFor the first time since the cluster was turned on nearly a year ago,
almost all of our data was getting persisted to disk.
So we converted the other two clusters as well.
Full Scale SSD ImpactFor the first time since the cluster was turned on nearly a year ago,
almost all of our data was getting persisted to disk.
So we converted the other two clusters as well.
Done?
Full Scale SSD ImpactFor the first time since the cluster was turned on nearly a year ago,
almost all of our data was getting persisted to disk.
So we converted the other two clusters as well.
Done?
Not Yet
Turning it up to 11While we were no longer completely on fire, we weren’t yet awesome
Turning it up to 11While we were no longer completely on fire, we weren’t yet awesome
We were still taking up to 40 minutes to persist new data
Turning it up to 11While we were no longer completely on fire, we weren’t yet awesome
We were still taking up to 40 minutes to persist new data
It wasn’t the drives at this point – it was the application
Turning it up to 11While we were no longer completely on fire, we weren’t yet awesome
We were still taking up to 40 minutes to persist new data
It wasn’t the drives at this point – it was the application
It wasn’t keeping up with the drives
Preparing Couchbase for Ludicrous Speed
Preparing Couchbase for Ludicrous SpeedIncrease the number of reader/writer threads to 8
Preparing Couchbase for Ludicrous SpeedConsolidate the buckets (4 high QPS buckets -> 2 high QPS buckets)
Preparing Couchbase for Ludicrous SpeedIncreased frequency of disk cleanup (exp_pager_stime) to every 10 minutes
And buckle your seatbelt
And buckle your seatbelt75% writes (sets + incr) / 25% reads – 13 byte values, 25 byte keys on average
2.5 billion items (+ 1 replica)
600 Gbytes of RAM / 3 Tbytes of disk in use on average
Average store latency ~ 0.4 milliseconds
99th percentile store latency ~ 2.5 milliseconds
Average get latency ~ 0.8 milliseconds
99th percentile get latency ~ 8 milliseconds
And buckle your seatbelt
And buckle your seatbelt
And buckle your seatbelt
The End