Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

51
Benjamin (Jerry) Franz Sr Site Reliability Engineer

Transcript of Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Page 1: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Benjamin (Jerry) FranzSr Site Reliability Engineer

Page 2: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Scalability and Performance

Lessons learned taking the second highest QPS Couchbase server at LinkedIn from zero to

awesome

Couchbase

Page 3: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

We have aCouchbase cluster?

Day 1

Page 4: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Meet the Couchbase Cluster• Three parallel clusters of 16 machines• 64 Gbytes of RAM per machine• 1 TB of RAID1 (spinning drives) per machine• 6 buckets in each cluster

Page 5: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Meet the Couchbase Cluster• Three parallel clusters of 16 machines• 64 Gbytes of RAM per machine• 1 TB of RAID1 (spinning drives) per machine• 6 buckets in each cluster• Massively under resourced• Memory completely full• 1 node failed out in each parallel cluster• Disk I/O Utilization: 100%, all the time• No Alerting

Page 6: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Meet the Couchbase Cluster• Three parallel clusters of 16 machines• 64 Gbytes of RAM per machine• 1 TB of RAID1 (spinning drives) per machine• 6 buckets in each cluster• Massively under resourced• Memory completely full• 1 node failed out in each parallel cluster• Disk I/O Utilization: 100%, all the time• No Alerting

Page 7: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

The immediate problems• Unable to store new data because the memory was full and there wasn’t enough

I/O capacity available to flush it to disk.

Page 8: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

The immediate problems• Unable to store new data because the memory was full and there wasn’t enough

I/O capacity available to flush it to disk.

• Aggravated by nodes failing out of the cluster - reducing both available memory and disk IOPS even further.

Page 9: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

The immediate problems• Unable to store new data because the memory was full and there wasn’t enough

I/O capacity available to flush it to disk.

• Aggravated by nodes failing out of the cluster - reducing both available memory and disk IOPS even further.

• There was no visibility into cluster health because1. We didn’t know what healthy metrics should look like for Couchbase. We

didn’t even know which metrics were most important.2. Alerts were not being sent even when the cluster was in deep trouble.

Page 10: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

The First Aid• Configured alerting

Page 11: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

The First Aid• Configured alerting

• Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.

Page 12: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

The First Aid• Configured alerting.

• Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.

• Doubled the number of nodes (from 48 to 96) in the clusters to improve available memory and disk IOPS.

Page 13: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

The First Aid• Configured alerting.

• Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.

• Doubled the number of nodes (from 48 to 96) in the clusters to improve available memory and disk IOPS.

• Increased the disk fragmentation threshold for compaction from 30% to 65% to reduce disk I/O.

Page 14: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

The First Aid• Configured alerting.

• Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.

• Doubled the number of nodes (from 48 to 96) in the clusters to improve available memory and disk IOPS.

• Increased the disk fragmentation threshold for compaction from 30% to 65% to reduce disk I/O.

• Reduced metadata expiration time from 3 days to 1 day to free memory.

Page 15: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Node failouts - Solved• The node failouts had two interacting causes:

1. Linux Transparent HugePages were active on many nodes causing semi-random slowdowns of nodes lasting up to several minutes when memory was defragged - causing them to fail out of the cluster. Fixed by correcting the kernel settings and restarting the nodes.

Page 16: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Node failouts - Solved• The node failouts had two interacting causes:

1. Linux Transparent HugePages were active on many nodes causing semi-random slowdowns of nodes lasting up to several minutes when memory was defragged - causing them to fail out of the cluster. Fixed by correcting the kernel settings and restarting the nodes.

2. ‘Pre-failure’ drives were going into data recovery mode and causing failouts on the affected nodes during the nightly access log scan at 10:00 UTC (02:00 PST/03:00 PDT).

Page 17: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Disk Persistence – Not solved• Despite more than doubling the available system resources, tuning

filesystem options for performance, and slashing the amount of data being fed to it by the application, disk utilization remained stubbornly close to 100% and the disk queues were still growing.

Page 18: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Disk Persistence – Not solved• Despite more than doubling the available system resources, tuning

filesystem options for performance, and slashing the amount of data being fed to it by the application, disk utilization remained stubbornly close to 100% and the disk queues were still growing.

• The cluster had a huge amount of ‘hidden I/O demand’. Because a large fraction of the data had very short TTLs but was taking up to a day to persist to disk, it was expiring in queue before it could be persisted. This was actually doing quite a lot to keep the cluster from falling over completely as it acted to throttle the disk demand as the cluster became overloaded. We were actually now persisting twice as much data as before.

Page 19: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Disk Persistence – Not solved• Despite more than doubling the available system resources, tuning

filesystem options for performance, and slashing the amount of data being fed to it by the application, disk utilization remained stubbornly close to 100% and the disk queues were still growing.

• The cluster had a huge amount of ‘hidden I/O demand’. Because a large fraction of the data had very short TTLs but was taking up to a day to persist to disk, it was expiring in queue before it could be persisted. This was actually doing quite a lot to keep the cluster from falling over completely as it acted to throttle the disk demand as the cluster became overloaded. We were actually now persisting twice as much data as before.

Page 20: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Cluster Health Visibility – Solved• Alerts were being sent to the appropriate people - cluster was not suffering

from outages without notices.

• Critical cluster metrics were identified and being used to for health monitoring and to measure performance tuning improvements.

Page 21: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Cluster Health Visibility – SolvedThe most important performance metrics

• ep_diskqueue_items – The number of items waiting to be persisted. This should be a somewhat stable number day to day. If if has a persistently upward trend that means cluster is unable to keep up with its disk I/O requirements.

Page 22: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Cluster Health Visibility – SolvedThe most important performance metrics

• ep_diskqueue_items – The number of items waiting to be persisted. This should be a somewhat stable number day to day. If if has a persistently upward trend that means cluster is unable to keep up with its disk I/O requirements.

• ep_storage_age – The age of the most recently persisted item. This has been a critical metric for quantifying the effects of configuration changes. A healthy cluster should keep this number close to or below 1 second on average. We started with values approaching days.

Page 23: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Cluster Health Visibility – SolvedThe most important performance metrics

• ep_diskqueue_items – The number of items waiting to be persisted. This should be a somewhat stable number day to day. If if has a persistently upward trend that means cluster is unable to keep up with its disk I/O requirements.

• ep_storage_age – The age of the most recently persisted item. This has been a critical metric for quantifying the effects of configuration changes. A healthy cluster should keep this number close to or below 1 second on average. We started with values approaching days.

• vb_active_perc_mem_resident – The percentage of items in the RAM cache. For most clusters at LinkedIn it should be 100%. If it falls below that, the cluster is probably underprovisioned and taking a big performance hit.

Page 24: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Overall Status Update

Page 25: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Servers. Lots of servers.My best estimate was that to meet our I/O requirements we would have to at least double our total node count again, to 192 servers total (3 x 64).

This was getting expensive

Page 26: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Servers. Lots of servers.My best estimate was that to meet our I/O requirements we would have to at least double our total node count again, to 192 servers total (3 x 64).

This was getting expensive

Page 27: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Servers. Lots of servers.My best estimate was that to meet our I/O requirements we would have to at least double our total node count again, to 192 servers total (3 x 64).

This was getting expensive

It was time to change up my strategy

Page 28: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

SSDs

Page 29: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

SSDsInitial Integration Testing

• 2 x 550GB Virident SSDs were integrated into one of the sub-clusters

Page 30: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

SSDsInitial Integration Testing

• 2 x 550GB Virident SSDs were integrated into one of the sub-clusters

• Reduced cluster to 16 nodes to test under heavier load

Page 31: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

SSDsInitial Integration Testing

• 2 x 550GB Virident SSDs were integrated into one of the sub-clusters

• Reduced cluster to 16 nodes to test under heavier load

• Write I/O on SSDs shot up to multiple times of the rate of the HDDs

Page 32: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

SSDsInitial Integration Testing

• 2 x 550GB Virident SSDs were integrated into one of the sub-clusters

• Reduced cluster to 16 nodes to test under heavier load

• Write I/O on SSDs shot up to multiple times of the rate of the HDDs

• Performance scaling indicated that in the final configuration we would burn through the SSD lifetime write capacity in less than one year

Page 33: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

SSDs

SSD Strategy Tuning

• Switched to 2200 GB Virident SSDs to extend service life

• Reduced cluster size to 8 nodes per sub-cluster (24 nodes total)

Page 34: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Full Scale SSD Impact

Page 35: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Full Scale SSD ImpactFor the first time since the cluster was turned on nearly a year ago,

almost all of our data was getting persisted to disk.

So we converted the other two clusters as well.

Page 36: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Full Scale SSD ImpactFor the first time since the cluster was turned on nearly a year ago,

almost all of our data was getting persisted to disk.

So we converted the other two clusters as well.

Done?

Page 37: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Full Scale SSD ImpactFor the first time since the cluster was turned on nearly a year ago,

almost all of our data was getting persisted to disk.

So we converted the other two clusters as well.

Done?

Not Yet

Page 38: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Turning it up to 11While we were no longer completely on fire, we weren’t yet awesome

Page 39: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Turning it up to 11While we were no longer completely on fire, we weren’t yet awesome

We were still taking up to 40 minutes to persist new data

Page 40: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Turning it up to 11While we were no longer completely on fire, we weren’t yet awesome

We were still taking up to 40 minutes to persist new data

It wasn’t the drives at this point – it was the application

Page 41: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Turning it up to 11While we were no longer completely on fire, we weren’t yet awesome

We were still taking up to 40 minutes to persist new data

It wasn’t the drives at this point – it was the application

It wasn’t keeping up with the drives

Page 42: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Preparing Couchbase for Ludicrous Speed

Page 43: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Preparing Couchbase for Ludicrous SpeedIncrease the number of reader/writer threads to 8

Page 44: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Preparing Couchbase for Ludicrous SpeedConsolidate the buckets (4 high QPS buckets -> 2 high QPS buckets)

Page 45: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Preparing Couchbase for Ludicrous SpeedIncreased frequency of disk cleanup (exp_pager_stime) to every 10 minutes

Page 46: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

And buckle your seatbelt

Page 47: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

And buckle your seatbelt75% writes (sets + incr) / 25% reads – 13 byte values, 25 byte keys on average

2.5 billion items (+ 1 replica)

600 Gbytes of RAM / 3 Tbytes of disk in use on average

Average store latency ~ 0.4 milliseconds

99th percentile store latency ~ 2.5 milliseconds

Average get latency ~ 0.8 milliseconds

99th percentile get latency ~ 8 milliseconds

Page 48: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

And buckle your seatbelt

Page 49: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

And buckle your seatbelt

Page 50: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

And buckle your seatbelt

Page 51: Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

The End