Managing Cassandra at Scale by Al Tobey

51
©2014 DataStax @AlTobey Open Source Mechanic | Datastax Apache Cassandra のオープンソースエバンジェリスト Managing Cassandra : SCALE 12x 1 Obsessed with infrastructure my whole life. See distributed systems everywhere.

Transcript of Managing Cassandra at Scale by Al Tobey

Page 1: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

@AlTobey Open Source Mechanic | Datastax Apache Cassandra のオープンソースエバンジェリスト

Managing Cassandra : SCALE 12x

!1

Obsessed with infrastructure my whole life.See distributed systems everywhere.

Page 2: Managing Cassandra at Scale by Al Tobey

Five years of Cassandra

0 1 2 3 4 5

0.1 0.3 0.6 0.7 1.0 1.2...

2.0

DSE

Jul-08

Page 3: Managing Cassandra at Scale by Al Tobey
Page 4: Managing Cassandra at Scale by Al Tobey

Why Cassandra?

Or any non-relational?

Page 5: Managing Cassandra at Scale by Al Tobey

アルトビー

Leo Rufus

My boysI did not have permission from my wife to use her image ;)

Page 6: Managing Cassandra at Scale by Al Tobey

!Less Tolerant

For the record. Can you spell HTTP?

More and more services on line.!And they must be available

Page 7: Managing Cassandra at Scale by Al Tobey

!日本で億のインターネットユーザー理由Cassandraを選ぶのか

100,000,000 internet users in Japanacross the world … B2B, video conferencing, shopping, entertainment

Page 8: Managing Cassandra at Scale by Al Tobey

!Traditional solutions…

!…may not be a good fit

Page 9: Managing Cassandra at Scale by Al Tobey

Client-server

All your wildest !dreams will come !

true.

Client - server database architecture.Obsolete.The original “funnel-shaped” architecture

Page 10: Managing Cassandra at Scale by Al Tobey

3-tier

Client - client - server database architecture.Still suitable for small applications.e.g. LAMP, RoR

Page 11: Managing Cassandra at Scale by Al Tobey

3-tier + caching

master slaveslave

cache

more complexcache coherency is a hard problemcascading failures are commonNext: Out: the funnel. In: The ring.

Page 12: Managing Cassandra at Scale by Al Tobey

Webscale

outer ring: clients (cell phones, etc.)middle ring: application serversinside ring: Cassandra servers!Serving millions of clients with mere hundreds or thousands of nodes requires a different approach to applications!

Page 13: Managing Cassandra at Scale by Al Tobey

When it Rains

scale out … but at a cost

Page 14: Managing Cassandra at Scale by Al Tobey

A new plan

Page 15: Managing Cassandra at Scale by Al Tobey

Dynamo Paper(2007)•How do we build a data store that is: • Reliable

• Performant

• “Always On”

•Nothing new and shiny

!!Evolutionary. Real. Computer Science

Also the basis for Riak and Voldemort

Page 16: Managing Cassandra at Scale by Al Tobey

BigTable(2006)

• Richer data model

• 1 key. Lots of values

• Fast sequential access

• 38 Papers cited

Page 17: Managing Cassandra at Scale by Al Tobey

Cassandra(2008)

• Distributed features of Dynamo

• Data Model and storage from BigTable

• February 17, 2010 it graduated to a top-level Apache project

Page 18: Managing Cassandra at Scale by Al Tobey

Cassandra - More than one server

• All nodes participate in a cluster

• Shared nothing

• Add or remove as needed

•More capacity? Add a server

!18

Page 19: Managing Cassandra at Scale by Al Tobey

!19

Cassandra HBase Redis MySQL

THRO

UG

HPU

T O

PS/S

EC)

VLDB benchmark (RWS)

Page 20: Managing Cassandra at Scale by Al Tobey

Cassandra - Fully Replicated

• Client writes local

• Data syncs across WAN

• Replication per Data Center

!20

Page 21: Managing Cassandra at Scale by Al Tobey

Read-Modify-WriteUPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24  WHERE  EmployeeID=1337;

EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********3Promoted****null

EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********4Promoted****2014501524

This might be what it looks like from SQL / CQL, but …!

Page 22: Managing Cassandra at Scale by Al Tobey

Read-Modify-WriteUPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24  WHERE  EmployeeID=1337;

TNSTAAFL 無償の昼食なんてものはありません

EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********4Promoted****2014501524

EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********3Promoted****null

RDBMS

TNSTAAFL …If you’re lucky, the cell is in cache.Otherwise, it’s a disk access to read, another to write.

Page 23: Managing Cassandra at Scale by Al Tobey

Eventual ConsistencyUPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24  WHERE  EmployeeID=1337;

EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********4Promoted****2014501524

EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********3Promoted****null

Coordinator

Explain distributed RMWMore complicated.Will talk about how it’s abstracted in CQL later.

Page 24: Managing Cassandra at Scale by Al Tobey

Eventual ConsistencyUPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24  WHERE  EmployeeID=1337;

EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********4Promoted****2014501524

EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********3Promoted****null

Coordinator

read

write

Memory replication on write, depending on RF, usually RF=3.Reads AND writes remain available through partitions.Hinted handoff.

Page 25: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

memTable

Avoiding read-modify-write

!25#CASSANDRA

Albert 6 Wednesday 0

Evan Tuesday 0 Wednesday 0

Frank Tuesday 3 Wednesday 3

Kelvin Tuesday 0 Wednesday 0

cassandra13_drinks column family

Krzysztof Tuesday 0 Wednesday 0

Phillip Tuesday 12 Wednesday 0

Tuesday

⁍ CF to track how much I expect my team at Ooyala to drink ⁍ Row keys are names ⁍ Column keys are days ⁍ Values are a count of drinks

Page 26: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

memTable

Avoiding read-modify-write

!26#CASSANDRA

Al Tuesday 2 Wednesday 0

Phillip Tuesday 0 Wednesday 1

cassandra13_drinks column family

ssTable

Albert 6 Wednesday 0

Evan Tuesday 0 Wednesday 0

Frank Tuesday 3 Wednesday 3

Kelvin Tuesday 0 Wednesday 0

Krzysztof Tuesday 0 Wednesday 0

Phillip Tuesday 12 Wednesday 0

Tuesday

⁍ Next day, after after a flush ⁍ I’m speaking so I decided to drink less ⁍ Phillip informs me that he has quit drinking

Page 27: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

memTable

Avoiding read-modify-write

!27#CASSANDRA

Albert Tuesday 22 Wednesday 0

cassandra13_drinks column family

ssTableAlbert Tuesday 2 Wednesday 0

Phillip Tuesday 0 Wednesday 1

ssTable

Albert 6 Wednesday 0

Evan Tuesday 0 Wednesday 0

Frank Tuesday 3 Wednesday 3

Kelvin Tuesday 0 Wednesday 0

Krzysztof Tuesday 0 Wednesday 0

Phillip Tuesday 12 Wednesday 0

Tuesday

⁍ I’m drinking with all you people so I decide to add 20 ⁍ read 2, add 20, write 22

Page 28: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

Avoiding read-modify-write

!28#CASSANDRA

cassandra13_drinks column family

ssTable

Albert Tuesday 22 Wednesday 0

Evan Tuesday 0 Wednesday 0

Frank Tuesday 3 Wednesday 3

Kelvin Tuesday 0 Wednesday 0

Krzysztof Tuesday 0 Wednesday 0

Phillip Tuesday 0 Wednesday 1

⁍ After compaction & conflict resolution ⁍ Overwriting the same value is just fine! Works really well for some patterns such as time-series data ⁍ Separate read/write streams handy for debugging, but not a big deal

Page 29: Managing Cassandra at Scale by Al Tobey

OverwritingCREATE TABLE host_lookup ( name varchar, id uuid, PRIMARY KEY(name) ); !INSERT INTO host_uuid (name,id) VALUES (“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); !INSERT INTO host_uuid (name,id) VALUES (“tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); !INSERT INTO host_uuid (name,id) VALUES (“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); !SELECT id FROM host_lookup WHERE name=“tobert.org”;

Beware of expensive compactionBest for: small indexes, lookup tablesCompaction handles RMW at storage level in the background.Under heavy writes, clock synchronization is very important to avoid timestamp collisions. In practice, this isn’t a problem very often and even when it goes wrong, not much harm done.

Page 30: Managing Cassandra at Scale by Al Tobey

Key/ValueCREATE TABLE keyval ( key VARCHAR, value blob, PRIMARY KEY(key) ); !INSERT INTO keyval (key,value) VALUES (?, ?); !SELECT value FROM keyval WHERE key=?;

e.g. memcachedDon’t do this.But it works when you really need it.

Page 31: Managing Cassandra at Scale by Al Tobey

Journaling / Logging / Time-seriesCREATE TABLE tsdb ( time_bucket timestamp, time timestamp, value blob, PRIMARY KEY(time_bucket, time) ); !INSERT INTO tsdb (time_bucket, time, value) VALUES ( “2014-10-24”, -- 1-day bucket (UTC) “2014-10-24T12:12:12Z”, -- ALWAYS USE UTC ‘{“foo”: “bar”}’ );

Oversimplified, use normalization over blobs whenever possible.ALWAYS USE UTC :)

Page 32: Managing Cassandra at Scale by Al Tobey

Journaling / Logging / Time-series

{"“2014(01(24”"=>"{""""“2014(01(24T12:12:12Z”"=>"{""""""""‘{“foo”:"“bar”}’""""}}

2014(01(24 2014(01(24T12:12:12Z{“key”:"“value”}

2014(01(25 2014(01(25T13:13:13Z{“key”:"“value”}

2014(01(24T21:21:21Z{“key”:"“value”}

Oversimplified, use normalization over blobs whenever possible.ALWAYS USE UTC :)

Page 33: Managing Cassandra at Scale by Al Tobey

Cassandra CollectionsCREATE TABLE posts ( id uuid, body varchar, created timestamp, authors set<varchar>, tags set<varchar>, PRIMARY KEY(id) ); !INSERT INTO posts (id,body,created,authors,tags) VALUES ( ea4aba7d-9344-4d08-8ca5-873aa1214068, ‘アルトビーの犬はばかね’, ‘now', [‘アルトビー’, ’ィオートビー’], [‘dog’, ‘silly’, ’犬’, ‘ばか’] );

quick story about 犬ばかねsets & maps are CRDTs, safe to modify

Page 34: Managing Cassandra at Scale by Al Tobey

Cassandra CollectionsCREATE TABLE metrics ( bucket timestamp, time timestamp, value blob, labels map<varchar,varchar>, PRIMARY KEY(bucket) );

sets & maps are CRDTs, safe to modify

Page 35: Managing Cassandra at Scale by Al Tobey

Lightweight Transactions• Cassandra 2.0 and on support LWT based on PAXOS

• PAXOS is a distributed consensus protocol

• Given a constraint, Cassandra ensures correct ordering

Page 36: Managing Cassandra at Scale by Al Tobey

Lightweight TransactionsUPDATE  users          SET  username=‘tobert’    WHERE  id=68021e8a-­‐9eb0-­‐436c-­‐8cdd-­‐aac629788383          IF  username=‘renice’;  !INSERT  INTO  users  (id,  username)  VALUES  (68021e8a-­‐9eb0-­‐436c-­‐8cdd-­‐aac629788383,  ‘renice’)  IF  NOT  EXISTS;  !!

Client error on conflict.

Page 37: Managing Cassandra at Scale by Al Tobey

Brendan Gregg’s Tools Chart

Page 38: Managing Cassandra at Scale by Al Tobey

dstat -lrvn 10

Page 39: Managing Cassandra at Scale by Al Tobey
Page 40: Managing Cassandra at Scale by Al Tobey

iostat -x 1

Page 41: Managing Cassandra at Scale by Al Tobey

htop

Page 42: Managing Cassandra at Scale by Al Tobey

Datastax Opscenter

Page 43: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

Config Changes: Apache 1.0 ➜ DSE 3.0

!43

⁍ Schema: compaction_strategy = LCS⁍ Schema: bloom_filter_fp_chance = 0.1⁍ Schema: sstable_size_in_mb = 256⁍ Schema: compression_options = Snappy⁍ YAML: compaction_throughput_mb_per_sec: 0

#CASSANDRA13

⁍ LCS is a huge improvement in operations life (no more major compactions) ⁍ Bloom filters were tipping over a 24GiB heap ⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger ⁍ > 100,000 open files slows everything down, especially startup ⁍ 256mb v.s. 5mb is 50x reduction in file count ⁍ default has been fixed as of C* 2.0 ⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled ⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs ⁍ backreference RMW? ⁍ might be fixed in >= 1.2

Page 44: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

!44

nodetool ring

10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 101204669472175663702469172037896580098

10.10.10.10 Analytics rack1 Up Normal 63.94 MB 0.86% 102671403812352122596707855690619718940

10.10.10.10 Analytics rack1 Up Normal 85.73 MB 0.86% 104138138152528581490946539343342857782

10.10.10.10 Analytics rack1 Up Normal 47.87 MB 0.86% 105604872492705040385185222996065996624

10.10.10.10 Analytics rack1 Up Normal 39.73 MB 0.86% 107071606832881499279423906648789135466

10.10.10.10 Analytics rack1 Up Normal 40.74 MB 1.75% 110042394566257506011458285920000334950

10.10.10.10 Analytics rack1 Up Normal 40.08 MB 2.20% 113781420866907675791616368030579466301

10.10.10.10 Analytics rack1 Up Normal 56.19 MB 3.45% 119650151395618797017962053073524524487

10.10.10.10 Analytics rack1 Up Normal 214.88 MB 11.62% 139424886777089715561324792149872061049

10.10.10.10 Analytics rack1 Up Normal 214.29 MB 2.45% 143588210871399618110700028431440799305

10.10.10.10 Analytics rack1 Up Normal 158.49 MB 1.76% 146577368624928021690175250344904436129

10.10.10.10 Analytics rack1 Up Normal 40.3 MB 0.92% 148140168357822348318107048925037023042

⁍ hotspots

Page 45: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

!45#CASSANDRA

nodetool cfstatsKeyspace: gostress

Read Count: 0

Read Latency: NaN ms.

Write Count: 0

Write Latency: NaN ms.

Pending Tasks: 0

Column Family: stressful

SSTable count: 1

Space used (live): 32981239

Space used (total): 32981239

Number of Keys (estimate): 128

Memtable Columns Count: 0

Memtable Data Size: 0

Memtable Switch Count: 0

Read Count: 0

Read Latency: NaN ms.

Write Count: 0

Write Latency: NaN ms.

Pending Tasks: 0

Bloom Filter False Positives: 0

Bloom Filter False Ratio: 0.00000

Bloom Filter Space Used: 336

Compacted row minimum size: 7007507

Compacted row maximum size: 8409007

Compacted row mean size: 8409007

Could be using a lot of heap

Controllable by sstable_size_in_mb

⁍ bloom filters ⁍ sstable_size_in_mb

Page 46: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

!46#CASSANDRA

nodetool proxyhistogramsOffset Read Latency Write Latency Range Latency

35 0 20 0

42 0 61 0

50 0 82 0

60 0 440 0

72 0 3416 0

86 0 17910 0

103 0 48675 0

124 1 97423 0

149 0 153109 0

179 2 186205 0

215 5 139022 0

258 134 44058 0

310 2656 60660 0

372 34698 742684 0

446 469515 7359351 0

535 3920391 31030588 0

642 9852708 33070248 0

770 4487796 9719615 0

924 651959 984889 0

⁍ units are microseconds ⁍ can give you a good idea of how much latency coordinator hops are costing you

Page 47: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

!47#CASSANDRA

nodetool compactionstats

al@node ~ $ nodetool compactionstats

pending tasks: 3

compaction type keyspace column family bytes compacted bytes total progress

Compaction hastur gauge_archive 9819749801 16922291634 58.03%

Compaction hastur counter_archive 12141850720 16147440484 75.19%

Compaction hastur mark_archive 647389841 1475432590 43.88%

Active compaction remaining time : n/a

al@node ~ $ nodetool compactionstats

pending tasks: 3

compaction type keyspace column family bytes compacted bytes total progress

Compaction hastur gauge_archive 10239806890 16922291634 60.51%

Compaction hastur counter_archive 12544404397 16147440484 77.69%

Compaction hastur mark_archive 1107897093 1475432590 75.09%

Active compaction remaining time : n/a

Page 48: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

!48#CASSANDRA

⁍ cassandra-stress⁍ YCSB⁍ Production⁍ Terasort (DSE)⁍ Homegrown

Stress Testing Tools

⁍ we mostly focus on cassandra-stress for burn-in of new clusters ⁍ can quickly figure out the right setting for -Xmn ⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!) ⁍ Consider writing custom benchmarks for your application patterns ⁍ sometimes it’s faster to write one than figure out how to make a generic tool do what you want

Page 49: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

!49#CASSANDRA

kernel.pid_max = 999999 fs.file-max = 1048576 vm.max_map_count = 1048576 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 65536 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 vm.swappiness = 1

/etc/sysctl.conf

⁍ pid_max doesn’t fix anything, I just like it and have never had a problem with it ⁍ These are my starting point settings for nearly every system/application. ⁍ Generally safe for production. ⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/file corruption on power loss ⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low

Page 50: Managing Cassandra at Scale by Al Tobey

©2014 DataStax

!50#CASSANDRA

ra=$((2**14))# 16k ss=$(blockdev --getss /dev/sda) blockdev --setra $(($ra / $ss)) /dev/sda !echo 128 > /sys/block/sda/queue/nr_requests echo deadline > /sys/block/sda/queue/scheduler echo 16384 > /sys/block/md7/md/stripe_cache_size

/etc/rc.local

⁍ Lower readahead is better for latency on seeky workloads ⁍ More readahead will artificially increase your IOPS by reading a bunch of stuff you might not need! ⁍ nr_requests = number of IO structs the kernel will keep in flight, don’t go crazy ⁍ Deadline is best for raw throughput ⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives ⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot. ⁍ Don’t forget to set readahead separately for MD RAID devices

Page 51: Managing Cassandra at Scale by Al Tobey

Ending discussion notes• 2 socket, ECC memory • 16GiB minimum, prefer 32-64GiB, over 128GiB and Linux will need serious tuning

• SSD where possible, Samsung 840 Pro is a good choice, any Intel is fine

•NO SAN/NAS, 20ms latency tops • if you MUST (and please, don’t) dedicate spindles to C* nodes, use separate network

• Avoid disk configurations targeted at Hadoop, disks are too slow

• http://www.datastax.com/documentation/cassandra/2.0/pdf/cassandra20.pdf • read the sections on Repair, Tombstones & Snapshots