Managing Cassandra at Scale by Al Tobey

©2014 DataStax

@AlTobey Open Source Mechanic | Datastax Apache Cassandra のオープンソースエバンジェリスト

Managing Cassandra : SCALE 12x

!1

Obsessed with infrastructure my whole life.See distributed systems everywhere.

Five years of Cassandra

0 1 2 3 4 5

0.1 0.3 0.6 0.7 1.0 1.2...

2.0

DSE

Jul-08

Why Cassandra?

Or any non-relational?

アルトビー

Leo Rufus

My boysI did not have permission from my wife to use her image ;)

!Less Tolerant

For the record. Can you spell HTTP?

More and more services on line.!And they must be available

!日本で億のインターネットユーザー理由Cassandraを選ぶのか

100,000,000 internet users in Japanacross the world … B2B, video conferencing, shopping, entertainment

!Traditional solutions…

!…may not be a good fit

Client-server

All your wildest !dreams will come !

true.

Client - server database architecture.Obsolete.The original “funnel-shaped” architecture

3-tier

Client - client - server database architecture.Still suitable for small applications.e.g. LAMP, RoR

3-tier + caching

master slaveslave

cache

more complexcache coherency is a hard problemcascading failures are commonNext: Out: the funnel. In: The ring.

Webscale

outer ring: clients (cell phones, etc.)middle ring: application serversinside ring: Cassandra servers!Serving millions of clients with mere hundreds or thousands of nodes requires a different approach to applications!

When it Rains

scale out … but at a cost

A new plan

Dynamo Paper(2007)•How do we build a data store that is: • Reliable

• Performant

• “Always On”

•Nothing new and shiny

!!Evolutionary. Real. Computer Science

Also the basis for Riak and Voldemort

BigTable(2006)

• Richer data model

• 1 key. Lots of values

• Fast sequential access

• 38 Papers cited

Cassandra(2008)

• Distributed features of Dynamo

• Data Model and storage from BigTable

• February 17, 2010 it graduated to a top-level Apache project

Cassandra - More than one server

• All nodes participate in a cluster

• Shared nothing

• Add or remove as needed

•More capacity? Add a server

!18

!19

Cassandra HBase Redis MySQL

THRO

UG

HPU

T O

PS/S

EC)

VLDB benchmark (RWS)

Cassandra - Fully Replicated

• Client writes local

• Data syncs across WAN

• Replication per Data Center

!20

Read-Modify-WriteUPDATE Employees SET Rank=4, Promoted=2014-‐01-‐24 WHERE EmployeeID=1337;

EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********3Promoted****null

EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********4Promoted****2014501524

This might be what it looks like from SQL / CQL, but …!

Read-Modify-WriteUPDATE Employees SET Rank=4, Promoted=2014-‐01-‐24 WHERE EmployeeID=1337;

TNSTAAFL 無償の昼食なんてものはありません



RDBMS

TNSTAAFL …If you’re lucky, the cell is in cache.Otherwise, it’s a disk access to read, another to write.

Eventual ConsistencyUPDATE Employees SET Rank=4, Promoted=2014-‐01-‐24 WHERE EmployeeID=1337;



Coordinator

Explain distributed RMWMore complicated.Will talk about how it’s abstracted in CQL later.

Eventual ConsistencyUPDATE Employees SET Rank=4, Promoted=2014-‐01-‐24 WHERE EmployeeID=1337;



Coordinator

read

write

Memory replication on write, depending on RF, usually RF=3.Reads AND writes remain available through partitions.Hinted handoff.

©2014 DataStax

memTable

Avoiding read-modify-write

!25#CASSANDRA

Albert 6 Wednesday 0

Evan Tuesday 0 Wednesday 0

Frank Tuesday 3 Wednesday 3

Kelvin Tuesday 0 Wednesday 0

cassandra13_drinks column family

Krzysztof Tuesday 0 Wednesday 0

Phillip Tuesday 12 Wednesday 0

Tuesday

⁍ CF to track how much I expect my team at Ooyala to drink ⁍ Row keys are names ⁍ Column keys are days ⁍ Values are a count of drinks

©2014 DataStax

memTable


!26#CASSANDRA

Al Tuesday 2 Wednesday 0



ssTable







Tuesday

⁍ Next day, after after a flush ⁍ I’m speaking so I decided to drink less ⁍ Phillip informs me that he has quit drinking

©2014 DataStax

memTable


!27#CASSANDRA

Albert Tuesday 22 Wednesday 0


ssTableAlbert Tuesday 2 Wednesday 0


ssTable







Tuesday

⁍ I’m drinking with all you people so I decide to add 20 ⁍ read 2, add 20, write 22

©2014 DataStax


!28#CASSANDRA


ssTable

Albert Tuesday 22 Wednesday 0






⁍ After compaction & conflict resolution ⁍ Overwriting the same value is just fine! Works really well for some patterns such as time-series data ⁍ Separate read/write streams handy for debugging, but not a big deal

OverwritingCREATE TABLE host_lookup ( name varchar, id uuid, PRIMARY KEY(name) ); !INSERT INTO host_uuid (name,id) VALUES (“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); !INSERT INTO host_uuid (name,id) VALUES (“tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); !INSERT INTO host_uuid (name,id) VALUES (“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); !SELECT id FROM host_lookup WHERE name=“tobert.org”;

Beware of expensive compactionBest for: small indexes, lookup tablesCompaction handles RMW at storage level in the background.Under heavy writes, clock synchronization is very important to avoid timestamp collisions. In practice, this isn’t a problem very often and even when it goes wrong, not much harm done.

http://tobert.org

Key/ValueCREATE TABLE keyval ( key VARCHAR, value blob, PRIMARY KEY(key) ); !INSERT INTO keyval (key,value) VALUES (?, ?); !SELECT value FROM keyval WHERE key=?;

e.g. memcachedDon’t do this.But it works when you really need it.

Journaling / Logging / Time-seriesCREATE TABLE tsdb ( time_bucket timestamp, time timestamp, value blob, PRIMARY KEY(time_bucket, time) ); !INSERT INTO tsdb (time_bucket, time, value) VALUES ( “2014-10-24”, -- 1-day bucket (UTC) “2014-10-24T12:12:12Z”, -- ALWAYS USE UTC ‘{“foo”: “bar”}’ );

Oversimplified, use normalization over blobs whenever possible.ALWAYS USE UTC :)

Journaling / Logging / Time-series

{"“2014(01(24”"=>"{""""“2014(01(24T12:12:12Z”"=>"{""""""""‘{“foo”:"“bar”}’""""}}

2014(01(24 2014(01(24T12:12:12Z{“key”:"“value”}

2014(01(25 2014(01(25T13:13:13Z{“key”:"“value”}

2014(01(24T21:21:21Z{“key”:"“value”}

Oversimplified, use normalization over blobs whenever possible.ALWAYS USE UTC :)

Cassandra CollectionsCREATE TABLE posts ( id uuid, body varchar, created timestamp, authors set<varchar>, tags set<varchar>, PRIMARY KEY(id) ); !INSERT INTO posts (id,body,created,authors,tags) VALUES ( ea4aba7d-9344-4d08-8ca5-873aa1214068, ‘アルトビーの犬はばかね’, ‘now', [‘アルトビー’, ’ィオートビー’], [‘dog’, ‘silly’, ’犬’, ‘ばか’] );

quick story about 犬ばかねsets & maps are CRDTs, safe to modify

Cassandra CollectionsCREATE TABLE metrics ( bucket timestamp, time timestamp, value blob, labels map<varchar,varchar>, PRIMARY KEY(bucket) );

sets & maps are CRDTs, safe to modify

Lightweight Transactions• Cassandra 2.0 and on support LWT based on PAXOS

• PAXOS is a distributed consensus protocol

• Given a constraint, Cassandra ensures correct ordering

Lightweight TransactionsUPDATE users SET username=‘tobert’ WHERE id=68021e8a-‐9eb0-‐436c-‐8cdd-‐aac629788383 IF username=‘renice’; !INSERT INTO users (id, username) VALUES (68021e8a-‐9eb0-‐436c-‐8cdd-‐aac629788383, ‘renice’) IF NOT EXISTS; !!

Client error on conflict.

Brendan Gregg’s Tools Chart

dstat -lrvn 10

iostat -x 1

Datastax Opscenter

©2014 DataStax

Config Changes: Apache 1.0 ➜ DSE 3.0

!43

⁍ Schema: compaction_strategy = LCS⁍ Schema: bloom_filter_fp_chance = 0.1⁍ Schema: sstable_size_in_mb = 256⁍ Schema: compression_options = Snappy⁍ YAML: compaction_throughput_mb_per_sec: 0

#CASSANDRA13

⁍ LCS is a huge improvement in operations life (no more major compactions) ⁍ Bloom filters were tipping over a 24GiB heap ⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger ⁍ > 100,000 open files slows everything down, especially startup ⁍ 256mb v.s. 5mb is 50x reduction in file count ⁍ default has been fixed as of C* 2.0 ⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled ⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs ⁍ backreference RMW? ⁍ might be fixed in >= 1.2

©2014 DataStax

!44

nodetool ring

10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 101204669472175663702469172037896580098












⁍ hotspots

©2014 DataStax

!45#CASSANDRA

nodetool cfstatsKeyspace: gostress

Read Count: 0

Read Latency: NaN ms.

Write Count: 0

Write Latency: NaN ms.

Pending Tasks: 0

Column Family: stressful

SSTable count: 1

Space used (live): 32981239

Space used (total): 32981239

Number of Keys (estimate): 128

Memtable Columns Count: 0

Memtable Data Size: 0

Memtable Switch Count: 0

Read Count: 0

Read Latency: NaN ms.

Write Count: 0

Write Latency: NaN ms.

Pending Tasks: 0

Bloom Filter False Positives: 0

Bloom Filter False Ratio: 0.00000

Bloom Filter Space Used: 336

Compacted row minimum size: 7007507

Compacted row maximum size: 8409007

Compacted row mean size: 8409007

Could be using a lot of heap

Controllable by sstable_size_in_mb

⁍ bloom filters ⁍ sstable_size_in_mb

©2014 DataStax

!46#CASSANDRA

nodetool proxyhistogramsOffset Read Latency Write Latency Range Latency

35 0 20 0

42 0 61 0

50 0 82 0

60 0 440 0

72 0 3416 0

86 0 17910 0

103 0 48675 0

124 1 97423 0

149 0 153109 0

179 2 186205 0

215 5 139022 0

258 134 44058 0

310 2656 60660 0

372 34698 742684 0

446 469515 7359351 0

535 3920391 31030588 0

642 9852708 33070248 0

770 4487796 9719615 0

924 651959 984889 0

⁍ units are microseconds ⁍ can give you a good idea of how much latency coordinator hops are costing you

©2014 DataStax

!47#CASSANDRA

nodetool compactionstats

al@node ~ $ nodetool compactionstats

pending tasks: 3

compaction type keyspace column family bytes compacted bytes total progress

Compaction hastur gauge_archive 9819749801 16922291634 58.03%

Compaction hastur counter_archive 12141850720 16147440484 75.19%

Compaction hastur mark_archive 647389841 1475432590 43.88%

Active compaction remaining time : n/a

al@node ~ $ nodetool compactionstats

pending tasks: 3

compaction type keyspace column family bytes compacted bytes total progress

Compaction hastur gauge_archive 10239806890 16922291634 60.51%

Compaction hastur counter_archive 12544404397 16147440484 77.69%

Compaction hastur mark_archive 1107897093 1475432590 75.09%

Active compaction remaining time : n/a

⁍

mailto:[email protected]

mailto:[email protected]

©2014 DataStax

!48#CASSANDRA

⁍ cassandra-stress⁍ YCSB⁍ Production⁍ Terasort (DSE)⁍ Homegrown

Stress Testing Tools

⁍ we mostly focus on cassandra-stress for burn-in of new clusters ⁍ can quickly figure out the right setting for -Xmn ⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!) ⁍ Consider writing custom benchmarks for your application patterns ⁍ sometimes it’s faster to write one than figure out how to make a generic tool do what you want

©2014 DataStax

!49#CASSANDRA

kernel.pid_max = 999999 fs.file-max = 1048576 vm.max_map_count = 1048576 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 65536 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 vm.swappiness = 1

/etc/sysctl.conf

⁍ pid_max doesn’t fix anything, I just like it and have never had a problem with it ⁍ These are my starting point settings for nearly every system/application. ⁍ Generally safe for production. ⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/file corruption on power loss ⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low

©2014 DataStax

!50#CASSANDRA

ra=$((2**14))# 16k ss=$(blockdev --getss /dev/sda) blockdev --setra $(($ra / $ss)) /dev/sda !echo 128 > /sys/block/sda/queue/nr_requests echo deadline > /sys/block/sda/queue/scheduler echo 16384 > /sys/block/md7/md/stripe_cache_size

/etc/rc.local

⁍ Lower readahead is better for latency on seeky workloads ⁍ More readahead will artificially increase your IOPS by reading a bunch of stuff you might not need! ⁍ nr_requests = number of IO structs the kernel will keep in flight, don’t go crazy ⁍ Deadline is best for raw throughput ⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives ⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot. ⁍ Don’t forget to set readahead separately for MD RAID devices

Ending discussion notes• 2 socket, ECC memory • 16GiB minimum, prefer 32-64GiB, over 128GiB and Linux will need serious tuning

• SSD where possible, Samsung 840 Pro is a good choice, any Intel is fine

•NO SAN/NAS, 20ms latency tops • if you MUST (and please, don’t) dedicate spindles to C* nodes, use separate network

• Avoid disk configurations targeted at Hadoop, disks are too slow

• http://www.datastax.com/documentation/cassandra/2.0/pdf/cassandra20.pdf • read the sections on Repair, Tombstones & Snapshots

http://www.datastax.com/documentation/cassandra/2.0/pdf/cassandra20.pdf

Managing Cassandra at Scale by Al Tobey

Technology

Transcript of Managing Cassandra at Scale by Al Tobey