C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

PRACTICE MAKES PERFECT:EXTREME CASSANDRA OPTIMIZATION

@AlTobeyTech Lead, Compute and Data Services

#CASSANDRA131Saturday, June 15, 13

I didn’t name this talk. The conference people did, but I like it a lot.

2

⁍ About me / Ooyala⁍ How not to manage your Cassandra clusters⁍ Make it suck less⁍ How to be a heuristician⁍ Tools of the trade⁍ More Settings⁍ Show & Tell

#CASSANDRA13

Outline

2Saturday, June 15, 13

3

⁍ Tech Lead, Compute and Data Services at Ooyala, Inc.⁍ C&D team is #devops: 3 ops, 3 eng, me⁍ C&D team is #bdaas: Big Data as a Service⁍ ~100 Cassandra nodes, expanding quickly⁍ Obligatory: we’re hiring

#CASSANDRA13

@AlTobey


⁍ I won’t go into devops today, but I’m happy to talk about it later.⁍ 2 years at Ooyala, SRE -> TL Tools Team -> C&D⁍ C&D builds BDaaS for Ooyala: fully managed Cassandra / Spark / Hadoop / Zookeeper / Kafka⁍ 11 clusters, 5-36 nodes, working on something big⁍ BEFORE: Engineers deployed systems: expensive, error-prone, AFTER: Engineers use API’s & consult

4

⁍ Founded in 2007⁍ 230+ employees globally⁍ 200M unique users,110+ countries⁍ Over 1 billion videos played per month⁍ Over 2 billion analytic events per day

#CASSANDRA13

Ooyala


5

Ooyala has been using Cassandra since v0.4Use cases: ⁍ Analytics data (real-time and batch) ⁍ Highly available K/V store ⁍ Time series data ⁍ Play head tracking (cross-device resume) ⁍ Machine Learning Data

#CASSANDRA13

Ooyala & Cassandra


Ooyala: Legacy Platform

cassandracassandracassandracassandra

6

S3

hadoophadoophadoophadoophadoop

cassandra

ABE Service

APIloggersplayers

START HERE

#CASSANDRA13

read-modify-write


⁍ Ruby MR -- CDH3u4 -- 80 Dell Blades⁍ Cassandra 0.4 --> 1.1 / DSE 3.x⁍ 18x Dell r509 48GiB RAM 6x 600G 15k SAS / MD RAID 5 -- more on RAID later⁍ We’ve scaled our data volume by 2x yearly for the last 4 years.

memTable

Avoiding read-modify-write

7#CASSANDRA13

Albert 6 Wednesday 0

Evan Tuesday 0 Wednesday 0

Frank Tuesday 3 Wednesday 3

Kelvin Tuesday 0 Wednesday 0

cassandra13_drinks column family

Krzysztof Tuesday 0 Wednesday 0

Phillip Tuesday 12 Wednesday 0

Tuesday


⁍ CF to track how much I expect my team at Ooyala to drink⁍ Row keys are names⁍ Column keys are days⁍ Values are a count of drinks

memTable


8#CASSANDRA13

Al Tuesday 2 Wednesday 0



ssTable







Tuesday


⁍ Next day, after after a flush⁍ I’m speaking so I decided to drink less⁍ Phillip informs me that he has quit drinking

memTable


9#CASSANDRA13

Albert Tuesday 22 Wednesday 0


ssTableAlbert Tuesday 2 Wednesday 0


ssTable







Tuesday


⁍ I’m drinking with all you people so I decide to add 20⁍ read 2, add 20, write 22


10#CASSANDRA13


ssTable

Albert Tuesday 22 Wednesday 0







⁍ After compaction & conflict resolution⁍ Overwriting the same value is just fine! Works really well for some patterns such as time-series data⁍ Separate read/write streams handy for debugging, but not a big deal

2011: 0.6 ➜ 0.8

11

⁍ Migration is still a largely unsolved problem⁍ Wrote a tool in Scala to scrub data and write via Thrift⁍ Rebuilt indexes - faster than copying

hadoopcassandra

GlusterFS P2Pcassandra

Thrift

#CASSANDRA13

Scala Map/Reduce


⁍ Because of some legacy choices, we know we had a bunch of expired tombstones⁍ GlusterFS: userspace, ionice(1), fast & easy⁍ Scala MR: sstabledump, etc. TOO SLOW, Scala MR only took a week (with production running too!)

Changes: 0.6 ➜ 0.8

12

⁍ Cassandra 0.8⁍ 24GiB heap⁍ Sun Java 1.6 update⁍ Linux 2.6.36⁍ XFS on MD RAID5⁍ Disabled swap or at least vm.swappiness=1


⁍ More on XFS settings & bugs later⁍ Got significant improvements from RAID & readahead tuning (more later)⁍ Al’s first rule of tuning databases: disable swap or GTFO⁍ fixed lots of applications by simply disabling swap

13

⁍ 18 nodes ➜ 36 nodes⁍ DSE 3.0⁍ Stale tombstones again!⁍ No downtime!

cassandraGlusterFS P2P

DSE 3.0

Thrift

#CASSANDRA13

Scala Map/Reduce

2012: Capacity Increase


⁍ I switched teams, working on Hastur, didn’t document enough, repairs were forgotten again⁍ 60 day GC Grace Period expired ... 3 months ago⁍ rsync is not enough for hardware moves: do rebuilds!⁍ Use DSE Map/Reduce to isolate most of the load from production

System Changes: Apache 1.0 ➜ DSE 3.0

14

⁍ DSE 3.0 installed via apt packages⁍ Unchanged: heap, distro⁍ Ran much faster this time!⁍ Mistake: Moved to MD RAID 0 Fix: RAID10 or RAID5, MD, ZFS, or btrfs⁍ Mistake: Running on Ubuntu Lucid Fix: Ubuntu Precise


⁍ Previously deployed with Capistrano⁍ DSE 3’s Hadoop is compiled on Debian 6 so native components will not load on 10.04’s libc⁍ still gradually rebuilding nodes from RAID0 ➜ RAID5 and Lucid -> Precise

Config Changes: Apache 1.0 ➜ DSE 3.0

15

⁍ Schema: compaction_strategy = LCS⁍ Schema: bloom_filter_fp_chance = 0.1⁍ Schema: sstable_size_in_mb = 256⁍ Schema: compression_options = Snappy⁍ YAML: compaction_throughput_mb_per_sec: 0


⁍ LCS is a huge improvement in operations life (no more major compactions)⁍ Bloom filters were tipping over a 24GiB heap⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger ⁍ > 100,000 open files slows everything down, especially startup ⁍ 256mb v.s. 5mb is 50x reduction in file count⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled ⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs ⁍ backreference RMW? ⁍ might be fixed in >= 1.2

16

⁍ 36 nodes ➜ lots more nodes⁍ As usual, no downtime!

#CASSANDRA13

DSE 3.1DSE 3.1

replication

2013: Datacenter Move


⁍ Size omitted in published slides. I was asked not to publish yet, I will tweet, etc. in a couple weeks.⁍ Wasn’t the original plan, but we save a lot of $$ by leaving old cage⁍ Prep for next-generation architecture!

17

Upcoming use cases: ⁍ Store every event from our players at full resolution ⁍ Cache code for our Spark job server ⁍ AMPLab Tachyon backend?

#CASSANDRA13

Coming Soon for Cassandra at Ooyala


⁍ This is the intro for the next slide / diagram.⁍ Considering Astyanax or CQL3 backend for Tachyon so we can contribute it back

18

spark

APIloggersplayers kafka

ingest

job server

#CASSANDRA13

DSE 3.1

Next Generation Architecture: Ooyala Event Store

Tachyon?


⁍ Look mom! No Hadoop! Remember what I said about latency?⁍ But we’re not just running DSE on these machines. They’re running: DSE, Spark, KVM, and CDH3u4 (legacy)⁍ Secret is cgroups!⁍ Also, ZFS (later)

19

⁍ Security⁍ Cost of Goods Sold⁍ Operations / support⁍ Developer happiness⁍ Physical capacity (cpu/memory/network/disk)⁍ Reliability / Resilience⁍ Compromise

#CASSANDRA13

There’s more to tuning than performance:


Shifting themes: philosophy of tuning⁍ Security is always #1: The decision to disable security features is an important decision!⁍ Example: EC2 instances sizes vary wildly in consistency and raw performance⁍ Leveled v.s. Size Tiered compaction, ZFS/LVM/MDRAID, bare metal v.s. EC2⁍ how much of this stuff do my devs need to know? How much work is it to get a new KS/CF?⁍ speed of node rebuilds, risk incurred by extended rebuilds, speed of repair a.) e.g. it takes a full 24 hours to repair each node in our 36-node cluster, so > 1 month to repair the cluster⁍ repeatable configurations, do future admins have to remember to do stuff or is it automated?⁍ Look up “John Allspaw Resilience”⁍ you only have access to EC2 or old hardware, your company has an OS/filesystem/settings policy (e.g. my $PREVIOUS_JOB CentOS 5.3 Linux 2.18.x hardened distro)

There are others of course.

20

⁍ I’d love to be more scientific, but production comes first⁍ Sometimes you have to make educated guesses⁍ It’s not as difficult as it’s made out to be⁍ Your brain is great at heuristics. Trust it.⁍ Concentrate on bottlenecks⁍ Make incremental changes⁍ Read Malcom Gladwell’s “Blink”

#CASSANDRA13

I am not a scientist ... heuristician?


⁍ A truly scientific approach would take a lot of time and resources.⁍ When under time pressure and things are slow, you have to move fast and measure “by the seat of your pants”⁍ Be educated, do research, and make sensible decisions without months of testing, be prepared to do better next time⁍ It’s actually pretty fast and easy this way!⁍ More on what tools I use later on.

21

Observe, Orient, Decide, Act:⁍ Observe the system in production under load⁍ Make small, safe changes⁍ Observe⁍ Commit or Revert

#CASSANDRA13

The OODA Loop


⁍ Understand YOUR production workload first!⁍ Look at Opscenter latency numbers⁍ cl-netstat.pl (later)⁍Examples: ⁍ Changing /proc/sys/vm/dirty_background_ratio is fairly safe and shows results quickly. ⁍ Some network settings can take your node offline, temporarily or require manual intervention. ⁍ Changing the compaction scheme requires a lot of time and has other implications.

Testing Shiny Things

22

⁍ Like kernels⁍ And Linux distributions⁍ And ZFS⁍ And btrfs⁍ And JVM’s & parameters⁍ Test them in production!


⁍ Testing stuff in a lab is fine, if you have one and you have the time.⁍ Take (responsible) advantage of Cassandra’s resilience: ⁍ test things you think should work well in production on ONE node or a couple nodes well spaced out.

ext4

ext4

ext4

ZFS

ext4

kernelupgrade

ext4

btrfs

Testing Shiny Things: In Production

23#CASSANDRA1323Saturday, June 15, 13

⁍ Use your staging / non-prod environments first if you have them (some people don’t and that’s unfortunate but it happens)⁍ test things you think should work well in production on ONE node or a couple nodes well spaced out.

24#CASSANDRA13

Brendan Gregg’s Tool Chart

http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x


⁍ Brendan Gregg’s chart is so good, I just copied it for now.⁍ Original: http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x⁍ I’ll briefly talk about a few



25#CASSANDRA13

dstat -lrvn 10


⁍ Just like vmstat but prettier and does way more⁍ 35 lines of output = about 5 minutes of 10s snapshots⁍ What’s interesting? ⁍ IO wait starting at line 5, but all numbers are going up, so this is probably during a map/reduce job ⁍ IO wait is high, but disk throughput isn’t impressive at all ⁍ ~2 blocked “procs” (which includes threads)

Not bothering to tune this right now because production latency is fine.

26#CASSANDRA13

cl-netstat.pl

https://github.com/tobert/perl-ssh-tools


⁍ Home grown.⁍ Requires no software on the target machines except for SSH.⁍ Recent Net::SSH2 supports ssh-agent



27#CASSANDRA13

iostat -x 1


⁍ Mostly I just look at the *wait numbers here.⁍ Great for finding a bad disk with high latency.

28#CASSANDRA13

htop


⁍ Per-CPU utilization bars are nice⁍ Displays threads by default (hit “H” in plain top)⁍ Very configurable!⁍ For example: 1 thread at 100% CPU is usually the GC

29#CASSANDRA13

jconsole


⁍ Looks like I can reduce the heap size on this cluster, but should probably increase -Xmn to 100mb * (physical cores) (not counting hypercores)

30#CASSANDRA13

opscenter


⁍ It looks better on a high-resolution display ;)

31#CASSANDRA13

nodetool ring

10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 10120466947217566370246917203789658009810.10.10.10 Analytics rack1 Up Normal 63.94 MB 0.86% 10267140381235212259670785569061971894010.10.10.10 Analytics rack1 Up Normal 85.73 MB 0.86% 10413813815252858149094653934334285778210.10.10.10 Analytics rack1 Up Normal 47.87 MB 0.86% 10560487249270504038518522299606599662410.10.10.10 Analytics rack1 Up Normal 39.73 MB 0.86% 10707160683288149927942390664878913546610.10.10.10 Analytics rack1 Up Normal 40.74 MB 1.75% 11004239456625750601145828592000033495010.10.10.10 Analytics rack1 Up Normal 40.08 MB 2.20% 11378142086690767579161636803057946630110.10.10.10 Analytics rack1 Up Normal 56.19 MB 3.45% 11965015139561879701796205307352452448710.10.10.10 Analytics rack1 Up Normal 214.88 MB 11.62% 13942488677708971556132479214987206104910.10.10.10 Analytics rack1 Up Normal 214.29 MB 2.45% 14358821087139961811070002843144079930510.10.10.10 Analytics rack1 Up Normal 158.49 MB 1.76% 14657736862492802169017525034490443612910.10.10.10 Analytics rack1 Up Normal 40.3 MB 0.92% 148140168357822348318107048925037023042


⁍ hotspots

32#CASSANDRA13

nodetool cfstatsKeyspace: gostress Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Column Family: stressful SSTable count: 1 Space used (live): 32981239 Space used (total): 32981239 Number of Keys (estimate): 128 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 0 Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Bloom Filter False Positives: 0 Bloom Filter False Ratio: 0.00000 Bloom Filter Space Used: 336 Compacted row minimum size: 7007507 Compacted row maximum size: 8409007 Compacted row mean size: 8409007

Could be using a lot of heap

Controllable by sstable_size_in_mb


⁍ bloom filters⁍ sstable_size_in_mb

33#CASSANDRA13

nodetool proxyhistogramsOffset Read Latency Write Latency Range Latency35 0 20 042 0 61 050 0 82 060 0 440 072 0 3416 086 0 17910 0103 0 48675 0124 1 97423 0149 0 153109 0179 2 186205 0215 5 139022 0258 134 44058 0310 2656 60660 0372 34698 742684 0446 469515 7359351 0535 3920391 31030588 0642 9852708 33070248 0770 4487796 9719615 0924 651959 984889 0


⁍ units are microseconds⁍ can give you a good idea of how much latency coordinator hops are costing you

34#CASSANDRA13

nodetool compactionstats

al@node ~ $ nodetool compactionstatspending tasks: 3 compaction type keyspace column family bytes compacted bytes total progress Compaction hastur gauge_archive 9819749801 16922291634 58.03% Compaction hastur counter_archive 12141850720 16147440484 75.19% Compaction hastur mark_archive 647389841 1475432590 43.88%Active compaction remaining time : n/aal@node ~ $ nodetool compactionstatspending tasks: 3 compaction type keyspace column family bytes compacted bytes total progress Compaction hastur gauge_archive 10239806890 16922291634 60.51% Compaction hastur counter_archive 12544404397 16147440484 77.69% Compaction hastur mark_archive 1107897093 1475432590 75.09%Active compaction remaining time : n/a


⁍

mailto:[email protected]


35#CASSANDRA13

⁍ cassandra-stress⁍ YCSB⁍ Production⁍ Terasort (DSE)⁍ Homegrown

Stress Testing Tools


⁍ we mostly focus on cassandra-stress for burn-in of new clusters⁍ can quickly figure out the right setting for -Xmn⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!)⁍ Consider writing custom benchmarks for your application patterns ⁍ sometimes it’s faster to write one than figure out how to make a generic tool do what you want

36#CASSANDRA13

kernel.pid_max = 999999fs.file-max = 1048576vm.max_map_count = 1048576net.core.rmem_max = 16777216net.core.wmem_max = 16777216net.ipv4.tcp_rmem = 4096 65536 16777216net.ipv4.tcp_wmem = 4096 65536 16777216vm.dirty_ratio = 10vm.dirty_background_ratio = 2vm.swappiness = 1

/etc/sysctl.conf


⁍ pid_max doesn’t fix anything, I just like it and have never had a problem with it⁍ These are my starting point settings for nearly every system/application.⁍ Generally safe for production.⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/file corruption on power loss⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low

37#CASSANDRA13

ra=$((2**14))# 16kss=$(blockdev --getss /dev/sda)blockdev --setra $(($ra / $ss)) /dev/sda

echo 256 > /sys/block/sda/queue/nr_requestsecho cfq > /sys/block/sda/queue/schedulerecho 16384 > /sys/block/md7/md/stripe_cache_size

/etc/rc.local


⁍ Lower readahead is better for latency on seeky workloads⁍ More readahead will artificially increase your IOPS by reading a bunch of stuff you might not need!⁍ nr_requests = number of IO structs the kernel will keep in flight, don’t go crazy⁍ Deadline is best for raw throughput⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot.⁍ Don’t forget to set readahead separately for MD RAID devices

38#CASSANDRA13

-Xmx8G leave it alone-Xms8G leave it alone-Xmn1200M 100MiB * nCPU-Xss180k should be fine

-XX:+UseNUMAnumactl --interleave

JVM Args


⁍ In general, most people should leave the defaults alone. Especially the heap, which can cause no end of trouble if you do it wrong and cause GC pauses.⁍ Don’t count hypercores.⁍ Our biggest bang for the buck so far has been tuning newsize.⁍ Have you ever seen “out of memory” when there’s plenty of memory available? You probably have a full NUMA node.⁍ NUMA is how modern machines are built. Older Apache Cassandra distros had numactl --interleave, but this doesn’t seem to be in the DSE scripts. I’ve been running +UseNUMA for about a year and a half now and it seems to work fine.

cgroups

39#CASSANDRA13

Provides fine-grained control over Linux resources⁍ Makes the Linux scheduler better⁍ Lets you manage systems under extreme load⁍ Useful on all Linux machines⁍ Can choose between determinism and flexibility


⁍ static resource assignment has better determinism / constentcy⁍ weighted resources provide most of the advantage with a lot more flexibility

cgroups

40#CASSANDRA13

cat >> /etc/default/cassandra <<EOFcpucg=/sys/fs/cgroup/cpu/cassandramkdir $cpucgcat $cpucg/../cpuset.mems >$cpucg/cpuset.memscat $cpucg/../cpuset.cpus >$cpucg/cpuset.cpusecho 100 > $cpucg/sharesecho $$ > $cpucg/tasksEOF


⁍ automatically adds cassandra to a CG called “cassandra”⁍ cpuset.mems can be used to limit NUMA nodes if you have huge machines⁍ cpuset.cpus can restrict tasks to specific cores (like taskset, stricter)⁍ shares is just a number, set your own scale, 1-1000 works for me⁍ adding a task to a CG is as simple as adding its PID⁍ children are not necessarily added, you must add threads too if joining after startup (ps -efL)

Successful Experiment: btrfs

41#CASSANDRA13

mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1mount -o compress=lzo /dev/sdc1 /data


⁍ Like ZFS, btrfs can manage multiple disks without mdraid or LVM.⁍ We have one production system in EC2 running btrfs flawlessly.⁍ I’m told there are problems when the disk fills up so don’t do that.⁍ noatime isn’t necessary on modern Linux, relatime is the default for xfs / ext4 and is good enough

Successful Experiment: ZFS on Linux

42#CASSANDRA13

zpool create data raidz /dev/sd[c-h]zfs create data/cassandrazfs set compression=lzjb data/cassandrazfs set atime=off data/cassandrazfs set logbias=throughput data/cassandra


⁍ ZFS really is the ultimate filesystem.⁍ RAIDZ is like RAID5 but totally different: ⁍ variable-width stripes ⁍ no write hole⁍ VERY fast, plays well with C*⁍ Stable! (so far)

Conclusions

43#CASSANDRA13

⁍ Tuning is multi-dimensional⁍ Production load is your most important benchmark⁍ Lean on Cassandra, experiment!⁍ No one metric tells the whole story


Questions?

44#CASSANDRA13

⁍ Twitter: @AlTobey⁍ Github: https://github.com/tobert⁍ Email: [email protected] / [email protected]






C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

Technology

Transcript of C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey