Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Transcript of Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring C* Health @ Scale
Jason Cacciatore
Netflix Scale
Hundreds of clusters
Thousands of nodes
How do we assess health ?
• Node Level– dmesg errors– gossip status– thresholds of system metrics (disk usage, heap, etc)
• Cluster Level– Rely on C*’s view of its health (nt ring)– AWS Cache as secondary source of truth
Common Approach
CRON System
JobRunnerJob
RunnerJobRunnerJob
Runner
Common Scenario
Node disappears
T0T1T2T3T4
Problems inherent in polling
● Point-in-time snapshot, no state● Establishing a connection to a cluster when it’s
under heavy load is problematic● Not resilient to network hiccups, especially for
large clusters
A different approach
What if we had a continuous stream of fine-grained snapshots ?
Mantis Streaming System
Stream processing system built on Apache Mesos
– Provides a flexible programming model– Models computation as a distributed DAG– Designed for high throughput, low latency– Open Source date soon
Streaming micro-services
Source
Stage
Sink
Source
Stage
SinkSource
Stage
Sink
Source - input, handles backpressureStage - business logicSink - output, handles backpressure
Source
Stage
Sink
MantisJob
Mantis Programming Model
• ReactiveX observable sequences• Source, Stage, and Sink together form an
observable chain (which only emits data when subscribed to)
Health Check using Mantis
Source
Job
LocalRingAgg
GlobalRingAgg
SourceJob
SourceJob
eu-west-1
us-east-1us-w
est-2
LocalRingAgg
LocalRingAgg
How much data ?
● Each node sends data every 20 seconds● Payload size depends on cluster size● ~6 MB/s total across east, west, and eu sent to
Local Ring Aggregators● ~600Kbps processed by Global Aggregator
Local Ring Aggregator
• Stateless• Single instance per region• Groups data by C* cluster and scores it
Local Ring Aggregator (cont)@Override
public Observable<String> call(Context context, Observable<MantisServerSentEvent> o) {
...
return
...
.filter(this::isValid)
.map(NodeRingMessage::filterByOwnership)
.buffer(config.getWindowInMillis(), TimeUnit.MILLISECONDS, 5000)
.map((nodeRingMessageList) -> new AggregatedView(nodeRingMessageList, config.getAWSClient()).score())
.flatMap((score) -> score)
.map(gson::toJson);
Anatomy of a Score
● Evidence - aggregate of all data points gathered from all nodes
● AWS view - each instance in the cluster● Cluster metadata (token to IP mapping, name, etc)
Global Ring Aggregator
Score
ScoreScore
ScoreScoreScoreScore
GroupedScoreScoreScoreScoreScoreScoreScoreScoreScoreScoreScoreScore
window
Global Ring Aggregator (cont)
reduce ClusterHealth
EvaluatorSINK
Real TimeDashboard
Score
RemediationSystem
transform
Common Scenario revisited
T0
Cluster Health Evaluator
T1
Score
FSM[ Start ]
T2
Score
FSM[ Node Gone ]
T3
Score
FSM[ Wait for Signal ]
T4
<Not tracked>
FSM[ Wait for Signal ]
<Not tracked>
FSM[ Done ]
That’s great, but...
Now the health of the fleet is encapsulated in a
single data stream, so how do we make sense of
that ?
Real Time Dash (Macro View)
Macro View of the fleet
Real Time Dash (Cluster View)
Real Time Dash (Perspective)
Benefits
● Faster detection of issues● Greater accuracy● Massive reduction in false positives● Operational - hosted by Mantis infrastructure● Separation of concerns (decouples detection from
remediation)
Monitoring the Monitor
● Mantis SLA● Bytes read + written● Incoming message count● Sink processed + dropped counts● CPU, memory
Questions ?