Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016

Monitoring C* Health @ Scale

Jason Cacciatore

Netflix Scale

Hundreds of clusters

Thousands of nodes

How do we assess health ?

• Node Level– dmesg errors– gossip status– thresholds of system metrics (disk usage, heap, etc)

• Cluster Level– Rely on C*’s view of its health (nt ring)– AWS Cache as secondary source of truth

Common Approach

CRON System

JobRunnerJob

RunnerJobRunnerJob

Runner

Common Scenario

Node disappears

T0T1T2T3T4

Problems inherent in polling

● Point-in-time snapshot, no state● Establishing a connection to a cluster when it’s

under heavy load is problematic● Not resilient to network hiccups, especially for

large clusters

A different approach

What if we had a continuous stream of fine-grained snapshots ?

Mantis Streaming System

Stream processing system built on Apache Mesos

– Provides a flexible programming model– Models computation as a distributed DAG– Designed for high throughput, low latency– Open Source date soon

Streaming micro-services

Source

Stage

Sink

Source

Stage

SinkSource

Stage

Sink

Source - input, handles backpressureStage - business logicSink - output, handles backpressure

Source

Stage

Sink

MantisJob

Mantis Programming Model

• ReactiveX observable sequences• Source, Stage, and Sink together form an

observable chain (which only emits data when subscribed to)

Health Check using Mantis

Source

Job

LocalRingAgg

GlobalRingAgg

SourceJob

SourceJob

eu-west-1

us-east-1us-w

est-2

LocalRingAgg

LocalRingAgg

How much data ?

● Each node sends data every 20 seconds● Payload size depends on cluster size● ~6 MB/s total across east, west, and eu sent to

Local Ring Aggregators● ~600Kbps processed by Global Aggregator

Local Ring Aggregator

• Stateless• Single instance per region• Groups data by C* cluster and scores it

Local Ring Aggregator (cont)@Override

public Observable<String> call(Context context, Observable<MantisServerSentEvent> o) {

...

return

...

.filter(this::isValid)

.map(NodeRingMessage::filterByOwnership)

.buffer(config.getWindowInMillis(), TimeUnit.MILLISECONDS, 5000)

.map((nodeRingMessageList) -> new AggregatedView(nodeRingMessageList, config.getAWSClient()).score())

.flatMap((score) -> score)

.map(gson::toJson);

Anatomy of a Score

● Evidence - aggregate of all data points gathered from all nodes

● AWS view - each instance in the cluster● Cluster metadata (token to IP mapping, name, etc)

Global Ring Aggregator

Score

ScoreScore

ScoreScoreScoreScore

GroupedScoreScoreScoreScoreScoreScoreScoreScoreScoreScoreScoreScore

window

Global Ring Aggregator (cont)

reduce ClusterHealth

EvaluatorSINK

Real TimeDashboard

Score

RemediationSystem

transform

Common Scenario revisited

T0

Cluster Health Evaluator

T1

Score

FSM[ Start ]

T2

Score

FSM[ Node Gone ]

T3

Score

FSM[ Wait for Signal ]

T4

<Not tracked>

FSM[ Wait for Signal ]

<Not tracked>

FSM[ Done ]

That’s great, but...

Now the health of the fleet is encapsulated in a

single data stream, so how do we make sense of

that ?

Real Time Dash (Macro View)

Macro View of the fleet

Real Time Dash (Cluster View)

Real Time Dash (Perspective)

Benefits

● Faster detection of issues● Greater accuracy● Massive reduction in false positives● Operational - hosted by Mantis infrastructure● Separation of concerns (decouples detection from

remediation)

Monitoring the Monitor

● Mantis SLA● Bytes read + written● Incoming message count● Sink processed + dropped counts● CPU, memory

Questions ?

Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016

Software

Transcript of Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016