How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Summit 2016

70
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License HOW CASSANDRA DELETES DATA Alain Rodriguez

Transcript of How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Summit 2016

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

HOW CASSANDRA DELETES DATAAlain Rodriguez

• Tombstone issues

• Why tombstones

• Tombstone removal

Introduction

About The Last Pickle

About The Last Pickle and Alain Rodriguez

About The Last Pickle and Alain Rodriguez

About deletes in Cassandra

Deleted data in Cassandra do not just disappear,

Deleted data in Cassandra do not just disappear,

instead a tombstone is added.

About deletes in Cassandra

Ok so what’s the matter, why this talk ?

Tombstone are needed in Cassandra, not an issue…

Ok so what’s the matter, why this talk ?

Tombstone are needed in Cassandra, not an issue…

…until an SSTables or a result to a query look like this…

Then we can see that in the user mailing list or other community tools

Ok so what’s the matter, why this talk ?

Then we can see that in the user mailing list or other community tools

So I thought I could share,about this topic.

thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

Ok so what’s the matter, why this talk ?

Tombstone issues

Tombstone issues: impacts

The read path: Reading tombstones induces

Latencies, Timeouts or Exceptions

Tombstone issues: impacts

The read path: Reading tombstones induces

Latencies, Timeouts or Exceptions

The disk space: tombstones can fill up the disk

100%

Tombstone issues: impacts

The read path: Reading tombstones induces

Latencies, Timeouts or Exceptions

The disk space: tombstones can fill up the disk

I am facing one of these issues, is it caused by tombstones?

100%

Tombstone issues: Read Path

grep -i -e "ERROR" -e "WARN" /var/log/cassandra/system.log

Tombstone issues: Read Path

grep -i -e "ERROR" -e "WARN" /var/log/cassandra/system.log

WARN [SharedPool-Worker-7] 2016-07-16 16:31:09,048 SliceQueryFilter.java:319 - Read 276 live and 1104 tombstone cells in mykeyspace.mytable for key: ItV9kZC8mFNiSvYM8AwufBU8tTtJkW5dUH5MNcq1H18 (see

tombstone_warn_threshold). 500 columns were requested, slices=[-]

Tombstone issues: Read Path

grep -i -e "ERROR" -e "WARN" /var/log/cassandra/system.log

WARN [SharedPool-Worker-7] 2016-07-16 16:31:09,048 SliceQueryFilter.java:319 - Read 276 live and 1104 tombstone cells in mykeyspace.mytable for key: ItV9kZC8mFNiSvYM8AwufBU8tTtJkW5dUH5MNcq1H18 (see

tombstone_warn_threshold). 500 columns were requested, slices=[-]

ERROR [ReadStage:290729] 2016-07-16 17:00:18,708 SliceQueryFilter.java (line 206) Scanned over 100000 tombstones in mykeyspace.mytable; query aborted (see tombstone_failure_threshold) ERROR [ReadStage:290729] 2016-04-22 17:00:18,709 CassandraDaemon.java (line 258) Exception in thread Thread[ReadStage:290729,5,main]

java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException

Tombstone issues: Read Path

tombstoneScannedHistogram metric

Through nodetool cfstats, JMX…

Tombstone issues: Read Path

tombstoneScannedHistogram metric

Through or a plugged monitoring tool such as Datadog, Grafana, SPM, OpsCenter…

Commercial

Free

Tombstone issues: Disk space

DroppableTombstoneRatio metric provide interesting info.

Tombstone issues: Disk space

DroppableTombstoneRatio metric provide interesting info.

Through sstablemetadata tool, JMX and plugged monitoring tool such as Datadog, Grafana, SPM, OpsCenter, etc.

Possible to write a script to check biggest SSTables ratio for example

Why tombstones?I want to remove data !

Why Tombstones: Cassandra write pathWrite path

Client write

Memory

Disk

Memtable

Commit Log SSTable SSTable

SSTable SSTable

Cassandra node

Flush

Immutable

Why Tombstones: Cassandra write pathWrite path

Client write

Memory

Disk

Memtable

Commit Log SSTable SSTable

SSTable SSTable

Cassandra node

Immutable

Client read

Flush

Why Tombstones: Distributed system

Cassandra is a distributed system

Distributed deletes are tricky !

Why Tombstones: Cassandra consistency Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Why Tombstones: Cassandra consistency Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Why Tombstones: Cassandra consistency Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Why Tombstones: Cassandra consistency Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

Why Tombstones: Cassandra consistency & availability Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

Down

Client write “A”

Client read “A”

Ack

Ack

High availability

Why Tombstones: Distributed deletes Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Why Tombstones: Distributed deletes Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Client delete “A”

Why Tombstones: Distributed deletes Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Client delete “A”

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”Ack

Ack

Why Tombstones: Distributed deletes Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Client delete “A”

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”Ack

Ack

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”

Client read “A”

Ack

Ack

Wrong

Why Tombstones: Distributed deletes Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Client delete “A”

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”Ack

Ack

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”

Client read “empty”

Ack

Ack

Correct

Why Tombstones: Distributed deletes

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Client delete “A”

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”Ack

Ack

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”

Client read “A”

Ack

Ack

Wrong

Why Tombstones: Distributed deletes

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Client delete “A”

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Client delete “A”

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”Ack

Ack

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”

Client read “A”

Ack

Ack

Wrong

Why Tombstones: Distributed deletes

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A

A

A

Client delete “A”

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

Client delete “A”Ack

Ack

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Client delete “A”

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”Ack

Ack

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”

Client read “A”

Ack

Ack

Wrong

Why Tombstones: Distributed deletes

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A

A

A

Client delete “A”

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

Client delete “A”Ack

Ack

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

Client delete “A”

Client read “A*”meaning “empty”

Ack

Ack

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Client delete “A”

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”Ack

Ack

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”

Client read “A”

Ack

Ack

Wrong Correct

Cool story, but I really want to remove the data !

Tombstone removal!

When are tombstones removed?

When should tombstones be removed?• Once the tombstone is fully replicated• When deleted data has been removed

When are tombstones removed?

When should tombstones be removed?• Once the tombstone is fully replicated• When deleted data has been removed

When are tombstones actually removed?• After gc_grace_seconds• During compactions

IF all the deleted data and the tombstone itself are involved

How tombstones are removed: Compaction!Write path

Client write

Memory

Disk

Memtable

Commit Log SSTable SSTable

SSTable SSTable

Cassandra node

Immutable

Client read

Flush

How tombstones are removed: Compaction!Write path

Client write

Memory

Disk

Memtable

Commit Log SSTable SSTable

SSTable SSTable

Cassandra node

Immutable

Client read

Compacting 4 SSTables

Flush

How tombstones are removed: Compaction!Write path

Client write

Memory

Disk

Memtable

Commit Log

SSTable

Cassandra node

Immutable

Client read

Flush

Implications in the real world

• No compaction = no eviction• + TTLs or deletes, tombstone stack (up to 100%)

Implications in the real world

• No compaction = no eviction• + TTLs or deletes, tombstone stack (up to 100%)

• Overlapping SSTable = no eviction• Fragmented data = eviction unlikely• LCS: tombstone level ≠ than data level = no eviction

Implications in the real world

• No compaction = no eviction• + TTLs or deletes, tombstone stack (up to 100%)

• Overlapping SSTable = no eviction• Fragmented data = eviction unlikely• LCS: tombstone level ≠ than data level = no eviction

• TTL << gc_grace_seconds = high % of useless data

Some tuning !

Good news:

Cassandra community and Committers are Awesome!

Some tuning !

Issue: No compaction = No eviction

CASSANDRA-3442: tombstone_threshold (C* 1.2.b1)

Compaction option, default:tombstone_threshold = 0.2 (ratio = 20% has been deleted)

Single SSTable compaction triggered based on an estimate!Low risk: worst case —> No-op

Some tuning !

Issue: Tombstone compaction loop!

CASSANDRA-4022: Check for key overlaps (C* 1.2.b1)

Internals improvement, not an option:

Estimated droppable tombstone improvedNow considering key overlapping with other SSTable

Some tuning !

Issue: Tombstone compaction loop!

CASSANDRA-4781: tombstone_compaction_interval (C* 1.2.b2)

Compaction option, default:tombstone_compaction_interval = 86400 (in seconds = 1 day)Definitely prevents loops

Some tuning !

Issue: Compacting to remove tombstone is expensive

CASSANDRA-5228: Expired SSTables (C*2.0.b1)

Internals improvement, not an optionEffective with Time series, DTCS / TWCS and TTLs !

Some tuning !

Issue: Tombstone compactions not triggering

CASSANDRA-6563: unchecked_tombstone_compaction (C* 2.0.9)

Compaction option, default:unchecked_tombstone_compaction = false

CASSANDRA-4022 becomes an option

Some tuning !

Issue: Overlapping preventing efficient tombstone compactions

CASSANDRA-7019: provide_overlapping_tombstones (C* 3.10)

Compaction option, default:provide_overlapping_tombstones = NONE (CELL / ROW / NONE)

Risky: • Not yet released, so not really tested• Heavier tombstones compactions

Some tuning - Tombstone distribution ! WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A

A

A

Client delete “A”

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

Client delete “A”Ack

Ack

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

Client delete “A”

Client read “A*”meaning “empty”

Ack

Ack

Correct

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Client delete “A”

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”Ack

Ack

Tombstones not replicated Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

Client delete “A”

Client read “A*”

Ack

Ack

Correct

Some tuning - Tombstone distribution !

Case were node fail + no repair=

Case without tombstone

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A

A

A

Client delete “A”

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

Client delete “A”Ack

Ack

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

Client delete “A”

Client read “A*”meaning “empty”

Ack

Ack

Correct

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Client delete “A”

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”Ack

Ack

Tombstones not replicated Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

AClient read “A” Wrong

A* removed

Some tuning - Tombstone distribution !

Case were node fail + no repair=

Case without tombstone=

Zombie data !

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A

A

A

Client delete “A”

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

Client delete “A”Ack

Ack

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

Client delete “A”

Client read “A*”meaning “empty”

Ack

Ack

Correct

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

?

?

?

Client write “A”

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”Ack

Ack

StrongConsistency

Consistency Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

?

Client write “A”

Client read “A ”

Ack

Ack

StrongConsistency

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

Client delete “A”

WITHOUT Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

Client delete “A”Ack

Ack

Tombstones not replicated Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

AClient read “A” Wrong

A* removed

Some tuning - Tombstone distribution !

CASSANDRA-6434 (C*3.0.b1):

only_purge_repaired_tombstones(Default: False)

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A

A

A

Client delete “A”

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

Client delete “A”Ack

Ack

Tombstones not replicated Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

A* not removed

Client read “A*”meaning “empty” Correct

Some tuning - Tombstone distribution !

CASSANDRA-6434 (C*3.0.b1):

only_purge_repaired_tombstones(Default: False)

Limitation

Repair failing or no repair=

permanent tombstone

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2

A

A

A

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A

A

A

Client delete “A”

WITH Tombstones Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

Client delete “A”Ack

Ack

Tombstones not replicated Cassandra Cluster 4 nodes RF = 3 Write CL = Quorum = 2 Read CL = Quorum = 2 A* = Tombstone on A

A*

A*

A

A* not removed

Client read “A*”meaning “empty” Correct

Conclusion

Things we know about tombstones

• Tombstones due to deletes and TTLs• Tombstone fits with Cassandra write path• Tombstones ensure consistency

• Reading tombstones is expensive and can produce failures• Tombstones take space on disk and might be tricky to remove• Tombstones need to be distributed before being removed

Takeaways

• Model data and workflow to avoid to reading many tombstones

• Deleted data = repair table within gc_grace_seconds

• Monitor tombstones, keep control! (Set some alerts ?)

• Use compaction options to tackle problems, there is always a way.

• Is there no way? Ask, or create a Jira and keep improving Cassandra!

Thank youQuestions ?

thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html