Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) •...

31

Transcript of Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) •...

Page 1: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0
Page 2: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Agenda

Page 3: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Emergency Response

••

•••

Page 4: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Managing Incidents

•••

Page 5: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Managing Incidents• Prioritize.Stop the bleeding, restore service, and preserve the evidence for root-causing.• Prepare.Develop and document your incident management procedures in advance, in consultation with

incident participants.• Trust.Give full autonomy within the assigned role to all incident participants.• Introspect.Pay attention to your emotional state while responding to an incident. If you start to feel

panicky or overwhelmed, solicit more support.• Consider alternatives.Periodically consider your options and re-evaluate whether it still makes sense to

continue what you’re doing or whether you should be taking another tack in incident response.• Practice.Use the process routinely so it becomes second nature.• Change it around.Were you incident commander last time? Take on a different role this time.

Encourage every team member to acquire familiarity with each role.

Page 6: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Troubleshooting

Page 7: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Troubleshooting - Report & Triage

••

Page 8: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Troubleshooting - Examine

•••

Page 9: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Troubleshooting - Diagnose

•••

Page 10: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Troubleshooting - Test / Treat

••••

Page 11: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

A Cassandra example

Page 12: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

A Cassandra example

Page 13: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

A Cassandra example

•••••

Page 14: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

A Cassandra example - Stabilize cluster

Page 15: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Monitoring Cassandra (tpstats)•

Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 2 2 1759 0 0 MutationStage 128 615267 118435822 0 0 MemtableReclaimMemory 0 0 210 0 0 PendingRangeCalculator 0 0 45 0 0 GossipStage 0 0 12390 0 0 SecondaryIndexManagement 0 0 0 0 0 HintsDispatcher 1 22 10 0 0 RequestResponseStage 1 5 519510274 0 0 Native-Transport-Requests 1 0 38354372 0 21184990 ReadRepairStage 0 0 1 0 0 CounterMutationStage 0 0 0 0 0 MigrationStage 0 0 65 0 0 MemtablePostFlush 1 1 231 0 0 PerDiskMemtableFlushWriter_0 1 1 210 0 0 ValidationExecutor 0 0 0 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 1 1 210 0 0 InternalResponseStage 0 0 2817415 0 0 ViewMutationStage 0 0 0 0 0 AntiEntropyStage 0 0 0 0 0 CacheCleanupExecutor 0 0 0 0 0

Page 16: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

A Cassandra example - Stabilize cluster

Page 17: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Monitoring Cassandra (tpstats)

••

•Message type DroppedRANGE_SLICE 0READ_REPAIR 23PAGED_RANGE 0BINARY 0READ 10434MUTATION 4948_TRACE 0REQUEST_RESPONSE 6COUNTER_MUTATION 0

Page 18: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

A Cassandra example - Stabilize cluster

Page 19: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

A Cassandra example

Page 20: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

A Cassandra example

Page 21: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

A Cassandra example

Page 22: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Monitoring Cassandra (status)

•Datacenter: us-west=============================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 10.65.XX.XXX 108.77 GB 256 ? e462bc9f-9df7-4342-b987-52a86d29c7f4 1aUN 10.65.XX.XXX 116.28 GB 256 ? 93530c86-3cb3-4d4e-a005-9f02ed4c0b3a 1cUN 10.65.XX.XXX 109.17 GB 256 ? ab779176-1513-4849-8531-6ff39037e078 1aUN 10.65.XX.XXX 103.1 GB 256 ? cd112339-3224-4b8f-9be0-de26edb3a0d1 1aUN 10.65.XX.XXX 111.45 GB 256 ? 3bfa406f-63f6-47e7-8798-6f650726ba23 1cUN 10.65.XX.XXX 110.09 GB 256 ? 5b39c8c2-4896-48b5-940d-d48b12157acf 1aUN 10.65.XX.XXX 105.18 GB 256 ? 467e03e4-0cdd-4088-b122-6b0e6848f7ed 1cUN 10.65.XX.XXX 112.22 GB 256 ? a48b999f-4473-4e85-83b2-1208fa63223c 1aUN 10.65.XX.XXX 107.69 GB 256 ? 9e48a874-57ca-40df-8053-dfb141389c09 1aUN 10.65.XX.XXX 109.21 GB 256 ? cb20eaa4-ba95-452f-9ac0-5ff41010b702 1cUN 10.65.XX.XXX 119.29 GB 256 ? 3cf1cd91-26ed-4057-b09b-9092c01e03ec 1cUN 10.65.XX.XXX 109.08 GB 256 ? d7aff1c4-0ace-46c2-b7db-a18f285fcdc4 1c

Page 23: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Monitoring Cassandra (Metrics)

Metric Description Frequency**Node Status Nodes DOWN should be investigated immediately Continuous,

with alerting**Client read latency Latency per read query over your threshold Continuous,

with alerting**Client write latency Latency per write query over your threshold Continuous,

with alertingCF read latency Local CF read latency per read, useful if some CF are particularly

latency sensitive.Continuous if required

Tombstones per read A large number of tombstones per read indicates possible performance problems, and compactions not keeping up or may require tuning.

Weekly checks

SSTables per read High number (>5) indicates data is spread across too ma Weekly checks**Pending compactions Sustained pending compactions (>20) indicates compactions are not

keeping up. This will have a performance impact.Continuous, with alerting

Pending repairs Continuous, when running

Page 24: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Cluster Health Checks (Logs)

••

•WARN [Native-Transport-Requests:3683972] 2015-03-02 00:20:30,639 BatchStatement.java (line 223) Batch of prepared statements for

[prod.network_traffic] is of size 195456, exceeding specified threshold of 5120 by 190336.

Page 25: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Cluster Health Checks

••

Page 26: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Backups

Page 27: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

netstats~ $ nodetool netstatsMode: JOININGBootstrap 24b26bf0-bc05-11e6-a95a-5d59c4606c05 /52.22.XXX.XXX (using /10.224.XXX.XXX) /52.22.XXX.XXX (using /10.224.XXX.XXX) Receiving 360 files, 40875561944 bytes total. Already received 1 files, 195299513 bytes total instametrics/events_raw_5m 195295154/278140764 bytes(70%) received from idx:0/52.22.XXX.XXX instametrics/host 4359/4359 bytes(100%) received from idx:0/52.22.XXX.XXX /52.55.XXX.XXX (using /10.224.130.6) Receiving 101 files, 34477437769 bytes total. Already received 4 files, 483917865 bytes total instametrics/events_raw_5m 4898307/4898307 bytes(100%) received from idx:0/52.55.XXX.XXX instametrics_rollup/events_rollup_3600 277979189/277979189 bytes(100%) received from idx:0/52.55.XXX.XXX instametrics_rollup/events_rollup_86400 1652187/1652187 bytes(100%) received from idx:0/52.55.XXX.XXX instametrics/host 3560/3560 bytes(100%) received from idx:0/52.55.XXX.XXX instametrics_rollup/events_rollup_300 199384622/11291788462 bytes(1%) received from idx:0/52.55.XXX.XXXRead Repair Statistics:Attempted: 0Mismatch (Blocking): 0Mismatch (Background): 0Pool Name Active Pending Completed DroppedLarge messages n/a 20 0 0Small messages n/a 1 50 0Gossip messages n/a 0 69238 0

Page 28: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

Some final tips

••

•••

••

••

Page 29: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0

How Instaclustr can help•

••

••

Page 31: Agenda - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com · Monitoring Cassandra (tpstats) • Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 1073 0 0