Cassandra at teads

86
Cassandra @ Lyon Cassandra Users Romain Hardouin - Cassandra architect @ Teads 2017-02-16

Transcript of Cassandra at teads

Page 1: Cassandra at teads

Cassandra @

Lyon Cassandra UsersRomain Hardouin - Cassandra architect @ Teads2017-02-16

Page 2: Cassandra at teads

I.

II.

III.

IV.

V.

VI.

VII.

VIII.

Cassandra @ TeadsAbout Teads

Architecture

Provisioning

Monitoring & alerting

Tuning

Tools

C’est la vie

A light fork

Page 3: Cassandra at teads

I . About Teads

Page 4: Cassandra at teads

Teads is the inventor of native video advertisingWith inRead, an award-winning format*

*IPA Media owner survey 2014, IAB recognized format

Page 5: Cassandra at teads

27 officesin 21 countries

500+Global employees

1.2B usersGlobal reach

90+R&D employees

Page 6: Cassandra at teads

Teads growthTracking events

Page 7: Cassandra at teads

Advertisers (to name a few)

Page 8: Cassandra at teads

Publishers (to name a few)

Page 9: Cassandra at teads

II. Architecture

Page 10: Cassandra at teads

Custom C* 2.1.16 C* 3.0 jvm.options

C* 2.2 logback

Backports

Patches

Apache Cassandra versionApache Cassandra version

Page 11: Cassandra at teads

UsageUsage

Up to 940K qps: Writes vs Reads

Page 12: Cassandra at teads

TopologyTopology

2 regions: EU & US 3rd region APAC coming soon

4 clusters7 DC110 nodes

Up to 150 with temporary DCs

HP server blades1 cluster18 nodes

Page 13: Cassandra at teads

AWS nodes

Page 14: Cassandra at teads

i2.2xlarge8 vCPU 61GB2 x 800 GB attached SSD in RAID0

c3.4xlarge16 vCPU 30GB 2 x 160 GB attached SSD in RAID0

c4.4xlarge16 vCPU 30GB EBS 3.4 TB + 1 TB

AWS instance typesAWS instance types

Tons of counters

Big Data, wide rows

Many billions keys, LCS with TTL

Page 15: Cassandra at teads

20 x c4.4xlarge with SSD GP2 3.4 TB data 10,000 IOPS ⇒ 16KB 1 TB commitlog 3,000 IOPS ⇒ 16KB

25 tables: batch + real timeTemporary DC

Cheap storage, great for STCSSnapshots (S3 backup)No coupling between disks and CPU/RAM

High latency => high I/O waitThroughput: 160 MB/sUnsteady performances

More on EBS nodesMore on EBS nodes

Page 16: Cassandra at teads

Physical nodes

Page 17: Cassandra at teads

HP Apollo XL170R Gen912 CPU Xeon @ 2.60GHz128 GB RAM3 x 1,5 TB High-end SSD in RAID0

Hardware nodesHardware nodes

For Big Data, supersedes EBS DC

Page 18: Cassandra at teads

DC/Cluster split

Page 19: Cassandra at teads

Instance type changeInstance type change

20 x i2.2xlarge 20 x c3.4xlarge

Counters

Cheaper and more CPUs

Counters Rebuild

DC X DC Y

Page 20: Cassandra at teads

Workload isolationWorkload isolation

20 x i2.2xlarge 20 x c3.4xlarge

Counters + Big Data

Counters

20 x c4.4xlarge

Big Data

EBS

Step 1: DC split

DC A DC B DC C

Rebuild +

Page 21: Cassandra at teads

Workload isolationWorkload isolation

20 x c4.4xlarge

Big Data

EBS

Step 2: Cluster split

Big DataAWS Direct Connect

Page 22: Cassandra at teads

Data model

Page 23: Cassandra at teads

“KISS” principle

No fancy stuff No secondary index No list/set/tuple No UDT

Page 24: Cassandra at teads

III. Provisioning

Page 25: Cassandra at teads

Capistrano Chef→

Custom Cookbooks: C* C* tools C* reaper Datadog wrapper

Chef provisioning to spawn a cluster

NowNow

Page 26: Cassandra at teads

C* cookbook michaelklishin/cassandra-chef-cookbook + Teads custom wrapper

Terraform + Chef provisioner

FutureFuture

Page 27: Cassandra at teads

IV. Monitoring & alerting

Page 28: Cassandra at teads

PastPast

OpsCenter (Free)

Page 29: Cassandra at teads

Turnkey dashboardsSupport is reactive

Main metrics onlyPer host graphs

impossible with many hosts

Page 30: Cassandra at teads

Ring viewMore than monitoringLots of metrics

Still lacks some metricsDashboard creation: no templatesAgent is heavyFree version limitations:

Data stored in production cluster Apache C* <= 2.1 only

DataStax OpsCenter

Free version (v5)

Page 31: Cassandra at teads

All metrics you wantDashboard creation

● Templating TimeBoard vs ScreenBoard

Graph creation Aggregation, trend, rate, anomaly detection

No turnkey dashboards yet May change: TLP templates

Additional fees if >350 metrics We need to increase this limit for our use case

Page 32: Cassandra at teads

Now we can easily Find outliers Compare a node to average Compare two DCs Explore a node’s metrics Create overview dashboards Create advanced dashboards for

troubleshooting

Page 33: Cassandra at teads

Datadog’s cassandra.yamlDatadog’s cassandra.yaml

- include: bean_regex: org.apache.cassandra.metrics:type=ReadRepair,name=* attribute: - Count

- include: bean_regex: org.apache.cassandra.metrics:type=CommitLog,name=(WaitingOnCommit|WaitingOnSegmentAllocation) attribute: - Count - 99thPercentile

- include: bean: org.apache.cassandra.metrics:type=CommitLog,name=TotalCommitLogSize

- include: bean: org.apache.cassandra.metrics:type=ThreadPools,path=transport,scope=Native-Transport-Requests,name=MaxTasksQueued attribute: Value: alias: cassandra.ntr.MaxTasksQueued

Page 34: Cassandra at teads

ScreenBoardScreenBoard

Page 35: Cassandra at teads

TimeBoardTimeBoard

alpha

beta

gamma

delta

epsilon

zeta

eta

alpha

Page 36: Cassandra at teads

ExampleExample

Hints monitoring during maintenance on physical nodes

Storage

Streaming

Page 37: Cassandra at teads

Datadog Alerting

Down nodeExceptionsCommitlog sizeHigh latencyHigh GCHigh IO WaitHigh PendingsMany hintsLong thrift connectionsClock out of syncDisk space…

Don’t miss this one

Don’t forget /

Page 38: Cassandra at teads

V. Tuning

Page 39: Cassandra at teads

Java 8CMS G1→

cassandra-env.sh-Dcassandra.max_queued_native_transport_requests=4096-Dcassandra.fd_initial_value_ms=4000-Dcassandra.fd_max_interval_ms=4000

Page 40: Cassandra at teads

GC logs enabled

-XX:MaxGCPauseMillis=200-XX:G1RSetUpdatingPauseTimePercent=5-XX:G1HeapRegionSize=32m-XX:G1HeapWastePercent=25

-XX:InitiatingHeapOccupancyPercent=?-XX:ParallelGCThreads=#CPU-XX:ConcGCThreads=#CPU

-XX:+ExplicitGCInvokesConcurrent-XX:+ParallelRefProcEnabled-XX:+UseCompressedOops

jvm.optionsjvm.options

-XX:HeapDumpPath= <dir with enough free space>-XX:ErrorFile= <custom dir>-Djava.io.tmpdir= <custom dir>

-XX:-UseBiasedLocking-XX:+UseTLAB-XX:+ResizeTLAB-XX:+PerfDisableSharedMem-XX:+AlwaysPreTouch...

Backport from C* 3.0

Page 41: Cassandra at teads

num_tokens: 256 native_transport_max_threads: 256 or 128compaction_throughput_mb_per_sec: 64concurrent_compactors: 4 or 2concurrent_reads: 64concurrent_writes: 128 or 64concurrent_counter_writes: 128hinted_handoff_throttle_in_kb: 10240max_hints_delivery_threads: 6 or 4memtable_cleanup_threshold: 0.6, 0.5 or 0.4memtable_flush_writers: 4 or 2trickle_fsync: truetrickle_fsync_interval_in_kb: 10240dynamic_snitch_badness_threshold: 2.0internode_compression: dc

AWS nodesAWS nodes

Heap c3.4xlarge: 15 GBi2.2xlarge: 24 GB

Page 42: Cassandra at teads

EBS volume != disk

compaction_throughput_mb_per_sec: 32concurrent_compactors: 4concurrent_reads: 32concurrent_writes: 64concurrent_counter_writes: 64trickle_fsync_interval_in_kb: 1024

AWS nodesAWS nodes

Heap c4.4xlarge: 15 GB

Page 43: Cassandra at teads

num_tokens: 8 initial_token: ...native_transport_max_threads: 512compaction_throughput_mb_per_sec: 128concurrent_compactors: 4concurrent_reads: 64concurrent_writes: 128concurrent_counter_writes: 128hinted_handoff_throttle_in_kb: 10240max_hints_delivery_threads: 6memtable_cleanup_threshold: 0.4memtable_flush_writers: 8trickle_fsync: truetrickle_fsync_interval_in_kb: 10240

Hardware nodesHardware nodes

More on this later Heap: 24 GB

Page 44: Cassandra at teads

Why 8 tokens?

Better repair performance, important for Big DataEvenly distributed tokens, stored in a Chef data bag

Hardware nodesHardware nodes

./vnodes_token_generator.py --json --indent 2 --servers hosts_interleaved_racks.txt 4{ "192.168.1.1": "-9223372036854775808,-4611686018427387905,-2,4611686018427387901", "192.168.2.1": "-7686143364045646507,-3074457345618258604,1537228672809129299,6148914691236517202", "192.168.3.1": "-6148914691236517206,-1537228672809129303,3074457345618258600,7686143364045646503"}

https://github.com/rhardouin/cassandra-scripts

Watch out! Know the drawbacks

Page 45: Cassandra at teads

Small entries, lots of reads

compression = {'chunk_length_kb': '4', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}+ nodetool scrub (few GB)

CompressionCompression

Page 46: Cassandra at teads

Disabled on 2 small clustersdynamic_snitch: false

Less hop count

Dynamic SnitchDynamic Snitch

Page 47: Cassandra at teads

Client side latency

Dynamic SnitchDynamic Snitch

P95

P75

Mean

Page 48: Cassandra at teads

Which node to decommission?

DownscaleDownscale

Page 49: Cassandra at teads

Clients

Page 50: Cassandra at teads

Scala appsDataStax driver wrapper

Spark & Spark streamingDataStax Spark Cassandra Connector

Page 51: Cassandra at teads

DataStax driver policy

LatencyAwarePolicy TokenAwarePolicy→

LatencyAwarePolicy Hotspots due to premature nodes eviction

Needs thorough tuning and steady workload

We drop it

TokenAwarePolicy Shuffle replicas depending on CL

Page 52: Cassandra at teads

For cross-region scheduled jobs

VPN between AWS regions

20 executors with 6GB RAM

output.consistency.level = (LOCAL_)ONE

output.concurrent.writes = 50

connection.compression = LZ4

Page 53: Cassandra at teads

Useless writes99% of empty unlogged batches on one DC

What an optimization!

Page 54: Cassandra at teads

VI. Tools

Page 55: Cassandra at teads

{Parallel SSH + cron} on steroids Security History

who/what/when/whyOutput is kept

CQL migrationRolling restartNodetool or JMX commandsBackup and snapshot jobs

“Job Scheduler & Runbook Automation”

We added a “comment” field

Page 56: Cassandra at teads

Scheduled range repairSegments: up to 20,000 for TB tables

Hosted fork for C* 2.1We will probably switch to TLP’s fork

We do not use incremental repairsSee fix in C* 4.0

Page 57: Cassandra at teads

cassandra_snapshotter Backup on S3 Scheduled with Rundeck

We created and use a fork Some PR merged upstream Restore PR still to be merged

Page 58: Cassandra at teads

Logs management"C* " and "out of sync""C* " and "new session: will sync" | count...

Alerts on pattern"C* " and "[ERROR]""C* " and "[WARN]" and not ( … )...

Page 59: Cassandra at teads

VII. C’est la vie

Page 60: Cassandra at teads

Cassandra issues & failures

Page 61: Cassandra at teads

OS reboot… seems harmless, right?Cassandra service enabled

Want a clue?C* 2.0 + counters

Upgrade to C* 2.1 was a relief

Without any obvious reason

Page 62: Cassandra at teads

Upgrade 2.0 2.1→

LCS cluster suffered High load Pending compactions was growing

Switch to off heap memtable Less GC => less load

Reduce clients load Better after sstables upgrade

Took days

Page 63: Cassandra at teads

Upgrade 2.0 2.1→

Lots of NTR “All time blocked”

NTR queue undersized for our workload 128 (hard coded)

We add a property to test CASSANDRA-11363 and set value higher and higher… up to 4096

NTR pool needs to be sized accordingly

Page 64: Cassandra at teads

After replacing nodes

DELETE FROM system.peers WHERE peer = '<replaced node>';

Used by DataStax Driver for auto discovery

Page 65: Cassandra at teads

AWSissues & failures

Page 66: Cassandra at teads

When you have to shoot, shoot, don’t talk!

The Good, the Bad and the Ugly

Page 67: Cassandra at teads

We’ll shoot your instance

We shot your instance

EBS volume latency spike

EBS volume unreachable

Page 68: Cassandra at teads

SSDSSD

Page 69: Cassandra at teads

SSDSSD

dstat -tnvlr

Page 70: Cassandra at teads

Hardwareissues & failures

Page 71: Cassandra at teads

One SSD failed

CPUs suddendly became slow on one server

Smart Array Battery BIOS bug

Yup, not a SSD...

Page 72: Cassandra at teads

VIII. A light fork

Page 73: Cassandra at teads

Why a fork?

1. Need to add a patch ASAP High Blocked NTR CASSANDRA-11363 Require to deploy from source

2. Why not backport interesting tickets?

3. Why not add small features/fixes? Expose tasks queue length via JMX CASSANDRA-12758

You betcha!

Page 74: Cassandra at teads

A tiny fork

We keep it as small as possible to fit our needs

Even smaller when we will upgrade to C* 3.0 Backports will be obsolete

Page 75: Cassandra at teads

« Hey, if your fork is tiny, is it that useful? »

Page 76: Cassandra at teads

One example: repair

Page 77: Cassandra at teads

Backport of CASSANDRA-12580

Paulo Motta: log2

vs ln

Page 78: Cassandra at teads

Get these info without DEBUG burden

Original fix

One-liner fix

Page 79: Cassandra at teads

The most impressive result for a set of tables: Before: 23 days With CASSANDRA-12580: 16 hours

Longest repair for a table: 2,5 days Impossible to repair this table before the patch Fit in gc_grace_seconds

Page 80: Cassandra at teads

It was a critical fix for us

It should have landed in 2.1.17 IMHO* Repair is a mandatory operation in many use cases

Paulo already made the patch for 2.1

C* 2.1 is widely used

[*] Full post: http://www.mail-archive.com/[email protected]/msg49344.html

Page 81: Cassandra at teads

« Why bother with backports? Why not upgrade to 3.0? »

Page 82: Cassandra at teads

Because C* is critical for our business

We don’t need fancy stuff (SASI, MV, UDF, ...)We just want a rock solid scalable DB

C* 2.1 does the job for the time being

We p l a n t o u p g ra d e t o C * 3 . 0 i n 2 0 1 7We w i l l d o t h o r o u g h t e s t s ; - )

Page 83: Cassandra at teads

« What about C* 2.2? »

Page 84: Cassandra at teads

C* 2.2 has some nice improvements: Boostrapping with LCS: send source sstable level 1 Range movement causes CPU & performance impact 2 Resumable bootstrap/rebuild streaming 3

[1] CASSANDRA-7460[2] CASSANDRA-9258[3] CASSANDRA-8838, CASSANDRA-8942, CASSANDRA-10810

But migration path 2.2 3.0 is risky→ Just my opinion based on users mailing list DSE never used 2.2

Page 85: Cassandra at teads

Questions?

Page 86: Cassandra at teads

Thanks!

C* devs for their awesome workZenika for hosting this meetupYou for being here!