Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.X

70
SF CASSANDRA USERS MARCH 2016 CQL PERFORMANCE WITH APACHE CASSANDRA 3.0 Aaron Morton @aaronmorton CEO Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

Transcript of Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.X

SF CASSANDRA USERS MARCH 2016

CQL PERFORMANCE WITH APACHE CASSANDRA 3.0

Aaron Morton@aaronmorton

CEO

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

About The Last Pickle.

Work with clients to deliver and improve Apache Cassandra based solutions.

Apache Cassandra Committer and DataStax MVPs.

Based in New Zealand, Australia, France & USA.

How We Got HereStorage Engine 3.0

Write PathRead Path

How We Got Here

Way back in 2011…

2011

Blog: Cassandra Query Plans

http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html

2012

Talk: Technical Deep Dive - Query Performance

https://www.youtube.com/watch?v=gomOKhMV0zc

2012

Explain Read & Write performance in 45 minutes.

Skip Forward to 2016

Blog: Introduction To The Apache Cassandra 3.x Storage

Enginehttp://thelastpickle.com/blog/2016/03/04/introductiont-to-

the-apache-cassandra-3-storage-engine.html

Skip Forward to 2016

“Why don’t I do another talk about Cassandra performance.”

Skip Forward to 2016

It was a busy 4 years…

Skip Forward to 2016

CQL 3, Collection Types, UDTs, UDF’s, UDA’s,

Materialised Views, Triggers, SASI,…

Skip Forward to 2016

Explain Read & Write performance in 45 minutes.

So Lets Avoid

CQL 3, Collection Types, UDTs, UDF’s, UDA’s,

Materialised Views, Triggers, SASI,…

How We Got HereStorage Engine 3.0

Write PathRead Path

High Level Storage Engine 3.0

Storage Engine 3.0 Files

Data.db Index.db Filter.db

Storage Engine 3.0 FilesCompressionInfo.db

Statistics.db Digest.crc32

CRC.db Summary.db TOC.txt

CQL Recapcreate table my_table ( partition_1 text, cluster_1 text, foo text, bar text, baz text, PRIMARY KEY (partition_1, cluster_1) );

CQL Recap

WARNING: FAKE DATA AHEAD

CQL With Thrift Pre 3.0[default@dev] list my_table; ------------------- RowKey: part_a => (column=clust_a:, value=, timestamp=1357…739000) => (column=clust_a:foo, value=some foo, timestamp=1357…739000) => (column=clust_a:bar, value=and bar, timestamp=1357…739000) => (column=clust_a:baz, value=no baz, timestamp=1357…739000) => (column=clust_b:, value=, timestamp=1357…739000) => (column=clust_b:foo, value=no foo, timestamp=1357…739000) => (column=clust_b:bar, value=no bar, timestamp=1357…739000) => (column=clust_b:baz, value=lots baz, timestamp=1357…739000)

CQL Pre 3.0

Clustering Keys RepeatedColumn Names Repeated

Timestamps RepeatedFixed Width Encoding

No Knowledge Of Row Contents

Storage Engine 3.0 Improvements

Delta EncodingVariable Int Encoding

Clustering Written OnceAggregated Metadata

Cell Presence

SerializationHeader

For each SSTable*.

Stored in each SSTable.

Held in memory.

SerializationHeaderpublic class SerializationHeader { private final AbstractType<?> keyType; private final List<AbstractType<?>> clusteringTypes;

private final PartitionColumns columns; private final EncodingStats stats; … }

EncodingStats

Collected on the fly by the Memtable.

EncodingStatspublic class EncodingStats { public final long minTimestamp; public final int minLocalDeletionTime; public final int minTTL; … }

SerializationHeaderpublic class SerializationHeader { public void writeTimestamp(long timestamp, DataOutputPlus out) throws IOException

{ out.writeUnsignedVInt(timestamp - stats.minTimestamp);

} … }

VIntCodingpublic class VIntCoding { public static void writeUnsignedVInt(long value, DataOutput output) throws IOException { int size = VIntCoding.computeUnsignedVIntSize(value); if (size == 1) { output.write((int)value); return; }

output.write(VIntCoding.encodeVInt(value, size), 0, size); }

Storage Engine 3.0 Improvements

Delta EncodingVariable Int Encoding

Clustering Written OnceAggregated Metadata

Cell Presence

CQL With Thrift Pre 3.0[default@dev] list my_table; ------------------- RowKey: part_a => (column=clust_a:, value=, timestamp=1357…739000) => (column=clust_a:foo, value=some foo, timestamp=1357…739000) => (column=clust_a:bar, value=and bar, timestamp=1357…739000) => (column=clust_a:baz, value=no baz, timestamp=1357…739000) => (column=clust_b:, value=, timestamp=1357…739000) => (column=clust_b:foo, value=no foo, timestamp=1357…739000) => (column=clust_b:bar, value=no bar, timestamp=1357…739000) => (column=clust_b:baz, value=lots baz, timestamp=1357…739000)

Storage Engine 3.0 Data.db

Storage Engine 3.0 Partition Header

Partition KeyPartition Deletion Information

Storage Engine 3.0 Partition Header

Storage Engine 3.0 Row

Clustering InformationRow Level LivenessRow Level DeletionColumn Presence

Columns

Storage Engine 3.0 Row

Storage Engine 3.0 Clustering Block

Clustering Cell PresenceClustering Cells

Storage Engine 3.0 Clustering Block

Storage Engine 3.0 Improvements

Delta EncodingVariable Int Encoding

Clustering Written OnceAggregated Cell Metadata

Cell Presence

CQL With Thrift Pre 3.0[default@dev] list my_table; ------------------- RowKey: part_a => (column=clust_a:, value=, timestamp=1357…739000) => (column=clust_a:foo, value=some foo, timestamp=1357…739000) => (column=clust_a:bar, value=and bar, timestamp=1357…739000) => (column=clust_a:baz, value=no baz, timestamp=1357…739000) => (column=clust_b:, value=, timestamp=1357…739000) => (column=clust_b:foo, value=no foo, timestamp=1357…739000) => (column=clust_b:bar, value=no bar, timestamp=1357…739000) => (column=clust_b:baz, value=lots baz, timestamp=1357…739000)

Aggregated Cell Metadata

Only store Cell Timestamp, TTL, and Local Deletion Time if different to

the Row.

Aggregated Cell MetadataSimple Cell Component Byte Size

Flags 1

Optional Cell Timestamp (delta) varint 1…n

Optional Cell Local Deletion Time (delta) varint 1…n

Optional Cell TTL (delta) varint 1…n

Fixed Width Cell Value Byte Size

Value 1…n

Optional Cell Value See Below

Variable Width Cell Value Byte Size

Value Length varint 1…n

Value 1…n

Apache Cassandra 3.0 Storage Engine

Storage Engine 3.0 Improvements

Delta EncodingVariable Int Encoding

Clustering Written OnceAggregated Cell Metadata

Cell Presence

Cell Presence

SSTable stores list of Cells in this SSTable.

Rows stores bitmap of Cells in this Row, with reference to SSTable.

Storage Engine 3.0 Row

Remember Where We Came From[default@dev] list my_table; ------------------- RowKey: part_a => (column=clust_a:, value=, timestamp=1357…739000) => (column=clust_a:foo, value=some foo, timestamp=1357…739000) => (column=clust_a:bar, value=and bar, timestamp=1357…739000) => (column=clust_a:baz, value=no baz, timestamp=1357…739000) => (column=clust_b:, value=, timestamp=1357…739000) => (column=clust_b:foo, value=no foo, timestamp=1357…739000) => (column=clust_b:bar, value=no bar, timestamp=1357…739000) => (column=clust_b:baz, value=lots baz, timestamp=1357…739000)

How We Got HereStorage Engine 3.0

Write PathRead Path

Write Path

Commit LogMerge Into Memtable

Commit Log

Allocate space in the current commit log segment.

Allocate Segmento.a.c.m.

CommitLog.WaitingOnSegmentAllocation.95thPercentile

Merge Into Memtable

Find the Partition.

Loop trying to update the Rows in it using CAS.

Merge Into Memtable

If more than 10MB wasted allocations move to

Pessimistic locking on the Partition object.

How We Got HereStorage Engine 3.0

Write PathRead Path

Read Paths

Ignoring Index Read paths.

Read Commands

PartitionRangeReadCommand SinglePartitionReadCommand

AbstractClusteringIndexFilter

ClusteringIndexNamesFilter (When we know the column names.)

ClusteringIndexSliceFilter (When we do not know the column names.)

ClusteringIndexNamesFilter

When we know what Columns to select, we know

when the search is over.

ClusteringIndexNamesFilter1. Get Partition From Memtables.2. Filter named columns into a temporary

result.3. Select SSTables that may contain Partition

Key.4. Order in descending timestamp order.5. Read from SSTables in order.

Names Filter Short Circuits

If result has a Partition Deletion newer than next SSTable max

timestamp.

Stop Search.

Names Filter Short Circuits

If read all Columns and max timestamp of next SSTable less than selected Columns min timestamp.

Stop Search.

Names Filter Short Circuits

Note: list of Columns remaining to select is pruned after every SSTable is read based on max timestamp.

Names Filter Short Circuits

If search clustering value not within clustering range in the SSTable.

Skip SSTable.

Names Filter Short Circuits

If SSTable Cell not in search set.

Skip reading value.

ClusteringIndexSliceFilter

When we do not know which columns to select, the search ends when it is exhausted.

ClusteringIndexSliceFilter

Used with:

Distinct.Not all clustering columns

restricted.

ClusteringIndexSliceFilter1. Get Partition From Memtables.2. Create Iterators for Partitions.3. Select SSTables that may contain Partition

Key.4. Order in reverse max timestamp order.5. Create Iterators for SSTables in order.

Slice Filter Short Circuits

If SSTable max timestamp is before max seen Partition Deletion

timestamp.

Stop Search.

Names Filter Short Circuits

If search clustering value not within clustering range in the SSTable.

Skip SSTable.

So…

3.x is awesome.Starting using it as soon as

possible.

Thanks.

Aaron Morton@aaronmorton

Co-Founder & Principal Consultantwww.thelastpickle.com