Cassandra at teads

download Cassandra at teads

of 86

  • date post

    21-Feb-2017
  • Category

    Software

  • view

    347
  • download

    1

Embed Size (px)

Transcript of Cassandra at teads

  • Cassandra @

    Lyon Cassandra UsersRomain Hardouin - Cassandra architect @ Teads2017-02-16

  • I.

    II.

    III.

    IV.

    V.

    VI.

    VII.

    VIII.

    Cassandra @ TeadsAbout Teads

    Architecture

    Provisioning

    Monitoring & alerting

    Tuning

    Tools

    Cest la vie

    A light fork

  • I . About Teads

  • Teads is the inventor of native video advertisingWith inRead, an award-winning format*

    *IPA Media owner survey 2014, IAB recognized format

  • 27 officesin 21 countries

    500+Global employees

    1.2B usersGlobal reach

    90+R&D employees

  • Teads growthTracking events

  • Advertisers (to name a few)

  • Publishers (to name a few)

  • II. Architecture

  • Custom C* 2.1.16 C* 3.0 jvm.options

    C* 2.2 logback

    Backports

    Patches

    Apache Cassandra versionApache Cassandra version

  • UsageUsage

    Up to 940K qps: Writes vs Reads

  • TopologyTopology

    2 regions: EU & US 3rd region APAC coming soon

    4 clusters7 DC110 nodes

    Up to 150 with temporary DCs

    HP server blades1 cluster18 nodes

  • AWS nodes

  • i2.2xlarge8 vCPU 61GB2 x 800 GB attached SSD in RAID0

    c3.4xlarge16 vCPU 30GB 2 x 160 GB attached SSD in RAID0

    c4.4xlarge16 vCPU 30GB EBS 3.4 TB + 1 TB

    AWS instance typesAWS instance types

    Tons of counters

    Big Data, wide rows

    Many billions keys, LCS with TTL

  • 20 x c4.4xlarge with SSD GP2 3.4 TB data 10,000 IOPS 16KB 1 TB commitlog 3,000 IOPS 16KB

    25 tables: batch + real timeTemporary DC

    Cheap storage, great for STCSSnapshots (S3 backup)No coupling between disks and CPU/RAM

    High latency => high I/O waitThroughput: 160 MB/sUnsteady performances

    More on EBS nodesMore on EBS nodes

  • Physical nodes

  • HP Apollo XL170R Gen912 CPU Xeon @ 2.60GHz128 GB RAM3 x 1,5 TB High-end SSD in RAID0

    Hardware nodesHardware nodes

    For Big Data, supersedes

    EBS DC

  • DC/Cluster split

  • Instance type changeInstance type change

    20 x i2.2xlarge 20 x c3.4xlarge

    Counters

    Cheaper and more CPUs

    Counters Rebuild

    DC X DC Y

  • Workload isolationWorkload isolation

    20 x i2.2xlarge 20 x c3.4xlarge

    Counters + Big Data

    Counters

    20 x c4.4xlarge

    Big Data

    EBS

    Step 1: DC split

    DC A DC B DC C

    Rebuild +

  • Workload isolationWorkload isolation

    20 x c4.4xlarge

    Big Data

    EBS

    Step 2: Cluster split

    Big DataAWS Direct Connect

  • Data model

  • KISS principle

    No fancy stuff No secondary index No list/set/tuple No UDT

  • III. Provisioning

  • Capistrano ChefCustom Cookbooks:

    C* C* tools C* reaper Datadog wrapper

    Chef provisioning to spawn a cluster

    NowNow

  • C* cookbook michaelklishin/cassandra-chef-cookbook + Teads custom wrapper

    Terraform + Chef provisioner

    FutureFuture

  • IV. Monitoring & alerting

  • PastPast

    OpsCenter (Free)

  • Turnkey dashboardsSupport is reactive

    Main metrics onlyPer host graphs

    impossible with many hosts

  • Ring viewMore than monitoringLots of metrics

    Still lacks some metricsDashboard creation: no templatesAgent is heavyFree version limitations:

    Data stored in production cluster Apache C*

  • All metrics you wantDashboard creation

    Templating TimeBoard vs ScreenBoard

    Graph creation Aggregation, trend, rate, anomaly detection

    No turnkey dashboards yet May change: TLP templates

    Additional fees if >350 metrics We need to increase this limit for our use case

  • Now we can easily Find outliers Compare a node to average Compare two DCs Explore a nodes metrics Create overview dashboards Create advanced dashboards for

    troubleshooting

  • Datadogs cassandra.yamlDatadogs cassandra.yaml

    - include: bean_regex: org.apache.cassandra.metrics:type=ReadRepair,name=* attribute: - Count

    - include: bean_regex: org.apache.cassandra.metrics:type=CommitLog,name=(WaitingOnCommit|WaitingOnSegmentAllocation) attribute: - Count - 99thPercentile

    - include: bean: org.apache.cassandra.metrics:type=CommitLog,name=TotalCommitLogSize

    - include: bean: org.apache.cassandra.metrics:type=ThreadPools,path=transport,scope=Native-Transport-Requests,name=MaxTasksQueued attribute: Value: alias: cassandra.ntr.MaxTasksQueued

  • ScreenBoardScreenBoard

  • TimeBoardTimeBoard

    alpha

    beta

    gamma

    delta

    epsilon

    zeta

    eta

    alpha

  • ExampleExample

    Hints monitoring during maintenance on physical nodes

    Storage

    Streaming

  • Datadog Alerting

    Down nodeExceptionsCommitlog sizeHigh latencyHigh GCHigh IO WaitHigh PendingsMany hintsLong thrift connectionsClock out of syncDisk space

    Dont miss this one

    Dont forget /

  • V. Tuning

  • Java 8CMS G1

    cassandra-env.sh-Dcassandra.max_queued_native_transport_requests=4096-Dcassandra.fd_initial_value_ms=4000-Dcassandra.fd_max_interval_ms=4000

  • GC logs enabled

    -XX:MaxGCPauseMillis=200-XX:G1RSetUpdatingPauseTimePercent=5-XX:G1HeapRegionSize=32m-XX:G1HeapWastePercent=25

    -XX:InitiatingHeapOccupancyPercent=?-XX:ParallelGCThreads=#CPU-XX:ConcGCThreads=#CPU

    -XX:+ExplicitGCInvokesConcurrent-XX:+ParallelRefProcEnabled-XX:+UseCompressedOops

    jvm.optionsjvm.options

    -XX:HeapDumpPath= -XX:ErrorFile= -Djava.io.tmpdir=

    -XX:-UseBiasedLocking-XX:+UseTLAB-XX:+ResizeTLAB-XX:+PerfDisableSharedMem-XX:+AlwaysPreTouch...

    Backport from C* 3.0

  • num_tokens: 256 native_transport_max_threads: 256 or 128compaction_throughput_mb_per_sec: 64concurrent_compactors: 4 or 2concurrent_reads: 64concurrent_writes: 128 or 64concurrent_counter_writes: 128hinted_handoff_throttle_in_kb: 10240max_hints_delivery_threads: 6 or 4memtable_cleanup_threshold: 0.6, 0.5 or 0.4memtable_flush_writers: 4 or 2trickle_fsync: truetrickle_fsync_interval_in_kb: 10240dynamic_snitch_badness_threshold: 2.0internode_compression: dc

    AWS nodesAWS nodes

    Heap c3.4xlarge: 15 GBi2.2xlarge: 24 GB

  • EBS volume != disk

    compaction_throughput_mb_per_sec: 32concurrent_compactors: 4concurrent_reads: 32concurrent_writes: 64concurrent_counter_writes: 64trickle_fsync_interval_in_kb: 1024

    AWS nodesAWS nodes

    Heap c4.4xlarge: 15 GB

  • num_tokens: 8 initial_token: ...native_transport_max_threads: 512compaction_throughput_mb_per_sec: 128concurrent_compactors: 4concurrent_reads: 64concurrent_writes: 128concurrent_counter_writes: 128hinted_handoff_throttle_in_kb: 10240max_hints_delivery_threads: 6memtable_cleanup_threshold: 0.4memtable_flush_writers: 8trickle_fsync: truetrickle_fsync_interval_in_kb: 10240

    Hardware nodesHardware nodes

    More on this later Heap: 24 GB

  • Why 8 tokens?

    Better repair performance, important for Big DataEvenly distributed tokens, stored in a Chef data bag

    Hardware nodesHardware nodes

    ./vnodes_token_generator.py --json --indent 2 --servers hosts_interleaved_racks.txt 4{ "192.168.1.1": "-9223372036854775808,-4611686018427387905,-2,4611686018427387901", "192.168.2.1": "-7686143364045646507,-3074457345618258604,1537228672809129299,6148914691236517202", "192.168.3.1": "-6148914691236517206,-1537228672809129303,3074457345618258600,7686143364045646503"}

    https://github.com/rhardouin/cassandra-scripts

    Watch out! Know the drawbacks

  • Small entries, lots of reads

    compression = {'chunk_length_kb': '4', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}+ nodetool scrub (few GB)

    CompressionCompression

  • Disabled on 2 small clustersdynamic_snitch: false

    Less hop count

    Dynamic SnitchDynamic Snitch

  • Client side latency

    Dynamic SnitchDynamic Snitch

    P95

    P75

    Mean

  • Which node to decommission?

    DownscaleDownscale

  • Clients

  • Scala appsDataStax driver wrapper

    Spark & Spark streamingDataStax Spark Cassandra Connector

  • DataStax driver policy

    LatencyAwarePolicy TokenAwarePolicy

    LatencyAwarePolicy Hotspots due to premature nodes eviction

    Needs thorough tuning and steady workload

    We drop it

    TokenAwarePolicy Shuffle replicas depending on CL

  • For cross-region scheduled jobs

    VPN between AWS regions

    20 executors with 6GB RAM

    output.consistency.level = (LOCAL_)ONEoutput.concurrent.writes = 50connection.compression = LZ4

  • Useless writes99% of empty unlogged batches on one DC

    What an optimization!

  • VI. Tools

  • {Parallel SSH + cron} on steroids Security History

    who/what/when/whyOutput is kept

    CQL migrationRolling restartNodetool or JMX commandsBackup and snapshot jobs

    Job Scheduler & Runbook Automation

    We added a comment field

  • Scheduled range repairSegments: up to 20,000 for TB tables

    Hosted fork for C* 2.1We will probably switch to TLPs fork

    We do not use incremental repairsSee fix in C* 4.0

  • cassandra_snapshotter Backup on S3 Scheduled with Rundeck

    We created and use a fork Some PR merged upstream Restore PR still to be merged

  • Logs management"C* " and "out of sync""C* " and "new sessi