C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

44
PRACTICE MAKES PERFECT: EXTREME CASSANDRA OPTIMIZATION @AlTobey Tech Lead, Compute and Data Services #CASSANDRA13 1 Saturday, June 15, 13 I didn’t name this talk. The conference people did, but I like it a lot.

description

Ooyala has been using Apache Cassandra since version 0.4. Our data ingest volume has exploded since 0.4 and Cassandra has scaled along with us. Al will cover many topics from an operational perspective on how to manage, tune, and scale Cassandra in a production environment.

Transcript of C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

Page 1: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

PRACTICE MAKES PERFECT:EXTREME CASSANDRA OPTIMIZATION

@AlTobeyTech Lead, Compute and Data Services

#CASSANDRA131Saturday, June 15, 13

I didn’t name this talk. The conference people did, but I like it a lot.

Page 2: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

2

⁍ About me / Ooyala⁍ How not to manage your Cassandra clusters⁍ Make it suck less⁍ How to be a heuristician⁍ Tools of the trade⁍ More Settings⁍ Show & Tell

#CASSANDRA13

Outline

2Saturday, June 15, 13

Page 3: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

3

⁍ Tech Lead, Compute and Data Services at Ooyala, Inc.⁍ C&D team is #devops: 3 ops, 3 eng, me⁍ C&D team is #bdaas: Big Data as a Service⁍ ~100 Cassandra nodes, expanding quickly⁍ Obligatory: we’re hiring

#CASSANDRA13

@AlTobey

3Saturday, June 15, 13

⁍ I won’t go into devops today, but I’m happy to talk about it later.⁍ 2 years at Ooyala, SRE -> TL Tools Team -> C&D⁍ C&D builds BDaaS for Ooyala: fully managed Cassandra / Spark / Hadoop / Zookeeper / Kafka⁍ 11 clusters, 5-36 nodes, working on something big⁍ BEFORE: Engineers deployed systems: expensive, error-prone, AFTER: Engineers use API’s & consult

Page 4: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

4

⁍ Founded in 2007⁍ 230+ employees globally⁍ 200M unique users,110+ countries⁍ Over 1 billion videos played per month⁍ Over 2 billion analytic events per day

#CASSANDRA13

Ooyala

4Saturday, June 15, 13

Page 5: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

5

Ooyala has been using Cassandra since v0.4Use cases: ⁍ Analytics data (real-time and batch) ⁍ Highly available K/V store ⁍ Time series data ⁍ Play head tracking (cross-device resume) ⁍ Machine Learning Data

#CASSANDRA13

Ooyala & Cassandra

5Saturday, June 15, 13

Page 6: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

Ooyala: Legacy Platform

cassandracassandracassandracassandra

6

S3

hadoophadoophadoophadoophadoop

cassandra

ABE Service

APIloggersplayers

START HERE

#CASSANDRA13

read-modify-write

6Saturday, June 15, 13

⁍ Ruby MR -- CDH3u4 -- 80 Dell Blades⁍ Cassandra 0.4 --> 1.1 / DSE 3.x⁍ 18x Dell r509 48GiB RAM 6x 600G 15k SAS / MD RAID 5 -- more on RAID later⁍ We’ve scaled our data volume by 2x yearly for the last 4 years.

Page 7: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

memTable

Avoiding read-modify-write

7#CASSANDRA13

Albert 6 Wednesday 0

Evan Tuesday 0 Wednesday 0

Frank Tuesday 3 Wednesday 3

Kelvin Tuesday 0 Wednesday 0

cassandra13_drinks column family

Krzysztof Tuesday 0 Wednesday 0

Phillip Tuesday 12 Wednesday 0

Tuesday

7Saturday, June 15, 13

⁍ CF to track how much I expect my team at Ooyala to drink⁍ Row keys are names⁍ Column keys are days⁍ Values are a count of drinks

Page 8: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

memTable

Avoiding read-modify-write

8#CASSANDRA13

Al Tuesday 2 Wednesday 0

Phillip Tuesday 0 Wednesday 1

cassandra13_drinks column family

ssTable

Albert 6 Wednesday 0

Evan Tuesday 0 Wednesday 0

Frank Tuesday 3 Wednesday 3

Kelvin Tuesday 0 Wednesday 0

Krzysztof Tuesday 0 Wednesday 0

Phillip Tuesday 12 Wednesday 0

Tuesday

8Saturday, June 15, 13

⁍ Next day, after after a flush⁍ I’m speaking so I decided to drink less⁍ Phillip informs me that he has quit drinking

Page 9: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

memTable

Avoiding read-modify-write

9#CASSANDRA13

Albert Tuesday 22 Wednesday 0

cassandra13_drinks column family

ssTableAlbert Tuesday 2 Wednesday 0

Phillip Tuesday 0 Wednesday 1

ssTable

Albert 6 Wednesday 0

Evan Tuesday 0 Wednesday 0

Frank Tuesday 3 Wednesday 3

Kelvin Tuesday 0 Wednesday 0

Krzysztof Tuesday 0 Wednesday 0

Phillip Tuesday 12 Wednesday 0

Tuesday

9Saturday, June 15, 13

⁍ I’m drinking with all you people so I decide to add 20⁍ read 2, add 20, write 22

Page 10: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

Avoiding read-modify-write

10#CASSANDRA13

cassandra13_drinks column family

ssTable

Albert Tuesday 22 Wednesday 0

Evan Tuesday 0 Wednesday 0

Frank Tuesday 3 Wednesday 3

Kelvin Tuesday 0 Wednesday 0

Krzysztof Tuesday 0 Wednesday 0

Phillip Tuesday 0 Wednesday 1

10Saturday, June 15, 13

⁍ After compaction & conflict resolution⁍ Overwriting the same value is just fine! Works really well for some patterns such as time-series data⁍ Separate read/write streams handy for debugging, but not a big deal

Page 11: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

2011: 0.6 ➜ 0.8

11

⁍ Migration is still a largely unsolved problem⁍ Wrote a tool in Scala to scrub data and write via Thrift⁍ Rebuilt indexes - faster than copying

hadoopcassandra

GlusterFS P2Pcassandra

Thrift

#CASSANDRA13

Scala Map/Reduce

11Saturday, June 15, 13

⁍ Because of some legacy choices, we know we had a bunch of expired tombstones⁍ GlusterFS: userspace, ionice(1), fast & easy⁍ Scala MR: sstabledump, etc. TOO SLOW, Scala MR only took a week (with production running too!)

Page 12: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

Changes: 0.6 ➜ 0.8

12

⁍ Cassandra 0.8⁍ 24GiB heap⁍ Sun Java 1.6 update⁍ Linux 2.6.36⁍ XFS on MD RAID5⁍ Disabled swap or at least vm.swappiness=1

#CASSANDRA1312Saturday, June 15, 13

⁍ More on XFS settings & bugs later⁍ Got significant improvements from RAID & readahead tuning (more later)⁍ Al’s first rule of tuning databases: disable swap or GTFO⁍ fixed lots of applications by simply disabling swap

Page 13: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

13

⁍ 18 nodes ➜ 36 nodes⁍ DSE 3.0⁍ Stale tombstones again!⁍ No downtime!

cassandraGlusterFS P2P

DSE 3.0

Thrift

#CASSANDRA13

Scala Map/Reduce

2012: Capacity Increase

13Saturday, June 15, 13

⁍ I switched teams, working on Hastur, didn’t document enough, repairs were forgotten again⁍ 60 day GC Grace Period expired ... 3 months ago⁍ rsync is not enough for hardware moves: do rebuilds!⁍ Use DSE Map/Reduce to isolate most of the load from production

Page 14: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

System Changes: Apache 1.0 ➜ DSE 3.0

14

⁍ DSE 3.0 installed via apt packages⁍ Unchanged: heap, distro⁍ Ran much faster this time!⁍ Mistake: Moved to MD RAID 0 Fix: RAID10 or RAID5, MD, ZFS, or btrfs⁍ Mistake: Running on Ubuntu Lucid Fix: Ubuntu Precise

#CASSANDRA1314Saturday, June 15, 13

⁍ Previously deployed with Capistrano⁍ DSE 3’s Hadoop is compiled on Debian 6 so native components will not load on 10.04’s libc⁍ still gradually rebuilding nodes from RAID0 ➜ RAID5 and Lucid -> Precise

Page 15: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

Config Changes: Apache 1.0 ➜ DSE 3.0

15

⁍ Schema: compaction_strategy = LCS⁍ Schema: bloom_filter_fp_chance = 0.1⁍ Schema: sstable_size_in_mb = 256⁍ Schema: compression_options = Snappy⁍ YAML: compaction_throughput_mb_per_sec: 0

#CASSANDRA1315Saturday, June 15, 13

⁍ LCS is a huge improvement in operations life (no more major compactions)⁍ Bloom filters were tipping over a 24GiB heap⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger ⁍ > 100,000 open files slows everything down, especially startup ⁍ 256mb v.s. 5mb is 50x reduction in file count⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled ⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs ⁍ backreference RMW? ⁍ might be fixed in >= 1.2

Page 16: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

16

⁍ 36 nodes ➜ lots more nodes⁍ As usual, no downtime!

#CASSANDRA13

DSE 3.1DSE 3.1

replication

2013: Datacenter Move

16Saturday, June 15, 13

⁍ Size omitted in published slides. I was asked not to publish yet, I will tweet, etc. in a couple weeks.⁍ Wasn’t the original plan, but we save a lot of $$ by leaving old cage⁍ Prep for next-generation architecture!

Page 17: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

17

Upcoming use cases: ⁍ Store every event from our players at full resolution ⁍ Cache code for our Spark job server ⁍ AMPLab Tachyon backend?

#CASSANDRA13

Coming Soon for Cassandra at Ooyala

17Saturday, June 15, 13

⁍ This is the intro for the next slide / diagram.⁍ Considering Astyanax or CQL3 backend for Tachyon so we can contribute it back

Page 18: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

18

spark

APIloggersplayers kafka

ingest

job server

#CASSANDRA13

DSE 3.1

Next Generation Architecture: Ooyala Event Store

Tachyon?

18Saturday, June 15, 13

⁍ Look mom! No Hadoop! Remember what I said about latency?⁍ But we’re not just running DSE on these machines. They’re running: DSE, Spark, KVM, and CDH3u4 (legacy)⁍ Secret is cgroups!⁍ Also, ZFS (later)

Page 19: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

19

⁍ Security⁍ Cost of Goods Sold⁍ Operations / support⁍ Developer happiness⁍ Physical capacity (cpu/memory/network/disk)⁍ Reliability / Resilience⁍ Compromise

#CASSANDRA13

There’s more to tuning than performance:

19Saturday, June 15, 13

Shifting themes: philosophy of tuning⁍ Security is always #1: The decision to disable security features is an important decision!⁍ Example: EC2 instances sizes vary wildly in consistency and raw performance⁍ Leveled v.s. Size Tiered compaction, ZFS/LVM/MDRAID, bare metal v.s. EC2⁍ how much of this stuff do my devs need to know? How much work is it to get a new KS/CF?⁍ speed of node rebuilds, risk incurred by extended rebuilds, speed of repair a.) e.g. it takes a full 24 hours to repair each node in our 36-node cluster, so > 1 month to repair the cluster⁍ repeatable configurations, do future admins have to remember to do stuff or is it automated?⁍ Look up “John Allspaw Resilience”⁍ you only have access to EC2 or old hardware, your company has an OS/filesystem/settings policy (e.g. my $PREVIOUS_JOB CentOS 5.3 Linux 2.18.x hardened distro)

There are others of course.

Page 20: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

20

⁍ I’d love to be more scientific, but production comes first⁍ Sometimes you have to make educated guesses⁍ It’s not as difficult as it’s made out to be⁍ Your brain is great at heuristics. Trust it.⁍ Concentrate on bottlenecks⁍ Make incremental changes⁍ Read Malcom Gladwell’s “Blink”

#CASSANDRA13

I am not a scientist ... heuristician?

20Saturday, June 15, 13

⁍ A truly scientific approach would take a lot of time and resources.⁍ When under time pressure and things are slow, you have to move fast and measure “by the seat of your pants”⁍ Be educated, do research, and make sensible decisions without months of testing, be prepared to do better next time⁍ It’s actually pretty fast and easy this way!⁍ More on what tools I use later on.

Page 21: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

21

Observe, Orient, Decide, Act:⁍ Observe the system in production under load⁍ Make small, safe changes⁍ Observe⁍ Commit or Revert

#CASSANDRA13

The OODA Loop

21Saturday, June 15, 13

⁍ Understand YOUR production workload first!⁍ Look at Opscenter latency numbers⁍ cl-netstat.pl (later)⁍Examples: ⁍ Changing /proc/sys/vm/dirty_background_ratio is fairly safe and shows results quickly. ⁍ Some network settings can take your node offline, temporarily or require manual intervention. ⁍ Changing the compaction scheme requires a lot of time and has other implications.

Page 22: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

Testing Shiny Things

22

⁍ Like kernels⁍ And Linux distributions⁍ And ZFS⁍ And btrfs⁍ And JVM’s & parameters⁍ Test them in production!

#CASSANDRA1322Saturday, June 15, 13

⁍ Testing stuff in a lab is fine, if you have one and you have the time.⁍ Take (responsible) advantage of Cassandra’s resilience: ⁍ test things you think should work well in production on ONE node or a couple nodes well spaced out.

Page 23: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

ext4

ext4

ext4

ZFS

ext4

kernelupgrade

ext4

btrfs

Testing Shiny Things: In Production

23#CASSANDRA1323Saturday, June 15, 13

⁍ Use your staging / non-prod environments first if you have them (some people don’t and that’s unfortunate but it happens)⁍ test things you think should work well in production on ONE node or a couple nodes well spaced out.

Page 24: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

24#CASSANDRA13

Brendan Gregg’s Tool Chart

http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x

24Saturday, June 15, 13

⁍ Brendan Gregg’s chart is so good, I just copied it for now.⁍ Original: http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x⁍ I’ll briefly talk about a few

Page 25: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

25#CASSANDRA13

dstat -lrvn 10

25Saturday, June 15, 13

⁍ Just like vmstat but prettier and does way more⁍ 35 lines of output = about 5 minutes of 10s snapshots⁍ What’s interesting? ⁍ IO wait starting at line 5, but all numbers are going up, so this is probably during a map/reduce job ⁍ IO wait is high, but disk throughput isn’t impressive at all ⁍ ~2 blocked “procs” (which includes threads)

Not bothering to tune this right now because production latency is fine.

Page 26: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

26#CASSANDRA13

cl-netstat.pl

https://github.com/tobert/perl-ssh-tools

26Saturday, June 15, 13

⁍ Home grown.⁍ Requires no software on the target machines except for SSH.⁍ Recent Net::SSH2 supports ssh-agent

Page 27: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

27#CASSANDRA13

iostat -x 1

27Saturday, June 15, 13

⁍ Mostly I just look at the *wait numbers here.⁍ Great for finding a bad disk with high latency.

Page 28: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

28#CASSANDRA13

htop

28Saturday, June 15, 13

⁍ Per-CPU utilization bars are nice⁍ Displays threads by default (hit “H” in plain top)⁍ Very configurable!⁍ For example: 1 thread at 100% CPU is usually the GC

Page 29: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

29#CASSANDRA13

jconsole

29Saturday, June 15, 13

⁍ Looks like I can reduce the heap size on this cluster, but should probably increase -Xmn to 100mb * (physical cores) (not counting hypercores)

Page 30: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

30#CASSANDRA13

opscenter

30Saturday, June 15, 13

⁍ It looks better on a high-resolution display ;)

Page 31: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

31#CASSANDRA13

nodetool ring

10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 10120466947217566370246917203789658009810.10.10.10 Analytics rack1 Up Normal 63.94 MB 0.86% 10267140381235212259670785569061971894010.10.10.10 Analytics rack1 Up Normal 85.73 MB 0.86% 10413813815252858149094653934334285778210.10.10.10 Analytics rack1 Up Normal 47.87 MB 0.86% 10560487249270504038518522299606599662410.10.10.10 Analytics rack1 Up Normal 39.73 MB 0.86% 10707160683288149927942390664878913546610.10.10.10 Analytics rack1 Up Normal 40.74 MB 1.75% 11004239456625750601145828592000033495010.10.10.10 Analytics rack1 Up Normal 40.08 MB 2.20% 11378142086690767579161636803057946630110.10.10.10 Analytics rack1 Up Normal 56.19 MB 3.45% 11965015139561879701796205307352452448710.10.10.10 Analytics rack1 Up Normal 214.88 MB 11.62% 13942488677708971556132479214987206104910.10.10.10 Analytics rack1 Up Normal 214.29 MB 2.45% 14358821087139961811070002843144079930510.10.10.10 Analytics rack1 Up Normal 158.49 MB 1.76% 14657736862492802169017525034490443612910.10.10.10 Analytics rack1 Up Normal 40.3 MB 0.92% 148140168357822348318107048925037023042

31Saturday, June 15, 13

⁍ hotspots

Page 32: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

32#CASSANDRA13

nodetool cfstatsKeyspace: gostress Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Column Family: stressful SSTable count: 1 Space used (live): 32981239 Space used (total): 32981239 Number of Keys (estimate): 128 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 0 Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Bloom Filter False Positives: 0 Bloom Filter False Ratio: 0.00000 Bloom Filter Space Used: 336 Compacted row minimum size: 7007507 Compacted row maximum size: 8409007 Compacted row mean size: 8409007

Could be using a lot of heap

Controllable by sstable_size_in_mb

32Saturday, June 15, 13

⁍ bloom filters⁍ sstable_size_in_mb

Page 33: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

33#CASSANDRA13

nodetool proxyhistogramsOffset Read Latency Write Latency Range Latency35 0 20 042 0 61 050 0 82 060 0 440 072 0 3416 086 0 17910 0103 0 48675 0124 1 97423 0149 0 153109 0179 2 186205 0215 5 139022 0258 134 44058 0310 2656 60660 0372 34698 742684 0446 469515 7359351 0535 3920391 31030588 0642 9852708 33070248 0770 4487796 9719615 0924 651959 984889 0

33Saturday, June 15, 13

⁍ units are microseconds⁍ can give you a good idea of how much latency coordinator hops are costing you

Page 34: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

34#CASSANDRA13

nodetool compactionstats

al@node ~ $ nodetool compactionstatspending tasks: 3 compaction type keyspace column family bytes compacted bytes total progress Compaction hastur gauge_archive 9819749801 16922291634 58.03% Compaction hastur counter_archive 12141850720 16147440484 75.19% Compaction hastur mark_archive 647389841 1475432590 43.88%Active compaction remaining time : n/aal@node ~ $ nodetool compactionstatspending tasks: 3 compaction type keyspace column family bytes compacted bytes total progress Compaction hastur gauge_archive 10239806890 16922291634 60.51% Compaction hastur counter_archive 12544404397 16147440484 77.69% Compaction hastur mark_archive 1107897093 1475432590 75.09%Active compaction remaining time : n/a

34Saturday, June 15, 13

Page 35: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

35#CASSANDRA13

⁍ cassandra-stress⁍ YCSB⁍ Production⁍ Terasort (DSE)⁍ Homegrown

Stress Testing Tools

35Saturday, June 15, 13

⁍ we mostly focus on cassandra-stress for burn-in of new clusters⁍ can quickly figure out the right setting for -Xmn⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!)⁍ Consider writing custom benchmarks for your application patterns ⁍ sometimes it’s faster to write one than figure out how to make a generic tool do what you want

Page 36: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

36#CASSANDRA13

kernel.pid_max = 999999fs.file-max = 1048576vm.max_map_count = 1048576net.core.rmem_max = 16777216net.core.wmem_max = 16777216net.ipv4.tcp_rmem = 4096 65536 16777216net.ipv4.tcp_wmem = 4096 65536 16777216vm.dirty_ratio = 10vm.dirty_background_ratio = 2vm.swappiness = 1

/etc/sysctl.conf

36Saturday, June 15, 13

⁍ pid_max doesn’t fix anything, I just like it and have never had a problem with it⁍ These are my starting point settings for nearly every system/application.⁍ Generally safe for production.⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/file corruption on power loss⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low

Page 37: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

37#CASSANDRA13

ra=$((2**14))# 16kss=$(blockdev --getss /dev/sda)blockdev --setra $(($ra / $ss)) /dev/sda

echo 256 > /sys/block/sda/queue/nr_requestsecho cfq > /sys/block/sda/queue/schedulerecho 16384 > /sys/block/md7/md/stripe_cache_size

/etc/rc.local

37Saturday, June 15, 13

⁍ Lower readahead is better for latency on seeky workloads⁍ More readahead will artificially increase your IOPS by reading a bunch of stuff you might not need!⁍ nr_requests = number of IO structs the kernel will keep in flight, don’t go crazy⁍ Deadline is best for raw throughput⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot.⁍ Don’t forget to set readahead separately for MD RAID devices

Page 38: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

38#CASSANDRA13

-Xmx8G leave it alone-Xms8G leave it alone-Xmn1200M 100MiB * nCPU-Xss180k should be fine

-XX:+UseNUMAnumactl --interleave

JVM Args

38Saturday, June 15, 13

⁍ In general, most people should leave the defaults alone. Especially the heap, which can cause no end of trouble if you do it wrong and cause GC pauses.⁍ Don’t count hypercores.⁍ Our biggest bang for the buck so far has been tuning newsize.⁍ Have you ever seen “out of memory” when there’s plenty of memory available? You probably have a full NUMA node.⁍ NUMA is how modern machines are built. Older Apache Cassandra distros had numactl --interleave, but this doesn’t seem to be in the DSE scripts. I’ve been running +UseNUMA for about a year and a half now and it seems to work fine.

Page 39: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

cgroups

39#CASSANDRA13

Provides fine-grained control over Linux resources⁍ Makes the Linux scheduler better⁍ Lets you manage systems under extreme load⁍ Useful on all Linux machines⁍ Can choose between determinism and flexibility

39Saturday, June 15, 13

⁍ static resource assignment has better determinism / constentcy⁍ weighted resources provide most of the advantage with a lot more flexibility

Page 40: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

cgroups

40#CASSANDRA13

cat >> /etc/default/cassandra <<EOFcpucg=/sys/fs/cgroup/cpu/cassandramkdir $cpucgcat $cpucg/../cpuset.mems >$cpucg/cpuset.memscat $cpucg/../cpuset.cpus >$cpucg/cpuset.cpusecho 100 > $cpucg/sharesecho $$ > $cpucg/tasksEOF

40Saturday, June 15, 13

⁍ automatically adds cassandra to a CG called “cassandra”⁍ cpuset.mems can be used to limit NUMA nodes if you have huge machines⁍ cpuset.cpus can restrict tasks to specific cores (like taskset, stricter)⁍ shares is just a number, set your own scale, 1-1000 works for me⁍ adding a task to a CG is as simple as adding its PID⁍ children are not necessarily added, you must add threads too if joining after startup (ps -efL)

Page 41: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

Successful Experiment: btrfs

41#CASSANDRA13

mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1mount -o compress=lzo /dev/sdc1 /data

41Saturday, June 15, 13

⁍ Like ZFS, btrfs can manage multiple disks without mdraid or LVM.⁍ We have one production system in EC2 running btrfs flawlessly.⁍ I’m told there are problems when the disk fills up so don’t do that.⁍ noatime isn’t necessary on modern Linux, relatime is the default for xfs / ext4 and is good enough

Page 42: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

Successful Experiment: ZFS on Linux

42#CASSANDRA13

zpool create data raidz /dev/sd[c-h]zfs create data/cassandrazfs set compression=lzjb data/cassandrazfs set atime=off data/cassandrazfs set logbias=throughput data/cassandra

42Saturday, June 15, 13

⁍ ZFS really is the ultimate filesystem.⁍ RAIDZ is like RAID5 but totally different: ⁍ variable-width stripes ⁍ no write hole⁍ VERY fast, plays well with C*⁍ Stable! (so far)

Page 43: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

Conclusions

43#CASSANDRA13

⁍ Tuning is multi-dimensional⁍ Production load is your most important benchmark⁍ Lean on Cassandra, experiment!⁍ No one metric tells the whole story

43Saturday, June 15, 13

Page 44: C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

Questions?

44#CASSANDRA13

⁍ Twitter: @AlTobey⁍ Github: https://github.com/tobert⁍ Email: [email protected] / [email protected]

44Saturday, June 15, 13