Fears, misconceptions, and accepted anti patterns of a first time cassandra adopter

Fears, Misconceptions, and Accepted Anti-patterns of a first

time Cassandra AdopterBen Christenson

Who am I?● Full Stack Developer / Architect

at Kinetic Data

● MSP Cassandra Meetups since August 2014

● I like cool technology, whisky,

and brazilian jiu-jitsu.

Why am I presenting?● Cassandra is a great strategy even if you aren’t looking for infinite OPS

○ Lots of articles for newbies and experts, not a lot of content on non-extreme use

● Give back to the meetup

○ I enjoy hearing about real implementations

○ Meetup is one of the reasons we chose Cassandra

● A little symbiotic selfishness

○ There may be better patterns that I don’t know about

○ Developing a more technical version of the presentation with sample code, metrics, etc

About Kinetic Data● About 50 employees

● Main office in St. Paul, secondary

office in Sydney Australia, satellite

Offices throughout US

● Develop software to improve

service experience

Cassandra Adoption ● What is Kinetic Request?

● Why Cassandra?

What is Kinetic Request?

What is Kinetic Request?Workflow automation

through Kinetic Task

What is Kinetic Request?● Originally developed on the BMC Action Request System (ARS)

● Evolved into a Java webapp

● Planning for ARS decoupling for about 5 years

● January 2015 - Started Request CE development

● May 2015 - Demoed Request CE prototype

● March 2016 - Released Request CE v1.0.0

Why Cassandra?● Multi-datacenter replication

○ Significantly improved performance for global customers

● Durability

● Scalability

○ Easier to start a current scale

○ Scale out without migrating

○ Scale size and throughput

● Community

Fears and Misconceptions

● StackOverflow-itis

● Kinetic Request is a deployed

solution

● Cassandra is for write-heavy

workloads

● Cassandra is for time series data

● Cassandra requires Java and

Linux experts

Stack Overflow-itisFears

● ALLOW FILTERING

○ Don’t ever use that!

● Secondary indexes

○ Don’t use those.

● Collections

○ Probably shouldn’t use those either…

● Counters

○ Don’t you want to use something that works

● Tombstones, Tombstones?!, TOMBSTONES!

Reality

● Thank you MSP Cassandra Meetup for being

the cure!

○ Everything was included for a reason

○ “You probably don’t want to use Xyz for that,

but here is when you would.”

Kinetic Request is a deployed solutionFear

● Very hard to find anyone using Cassandra

for customer-managed solutions

Reality

● Many of our customers already pay for

Cassandra support

● Many of our customers understand the

benefits

● As a deployed solution our data usage and

schemas don’t change frequently and

potential issues are (hopefully) caught

before reaching the customer

● Possible future talk?

Cassandra is for write-heavy workloadsMisconception

● Cassandra is only for write-heavy workloads

Reality

● Cassandra is extremely good at write-heavy

workloads

● Cassandra can be implemented to be good at

read-heavy workloads

● Even with heavy delete-and-insert updates,

reads are still outperforming previous

versions of Kinetic Request

Cassandra is for time series dataMisconception

● Cassandra is only for time series data

Reality

● Cassandra is extremely good at time series

data

● Cassandra is extremely good at replicating

all data

● Just because Cassandra can handle extreme

operations per second, doesn’t mean it isn’t

suited for lower OPS usage (and you can get

away with a lot more)

Cassandra requires Java and Linux expertsMisconception

● We were going to need to become Java and

Linux experts to use Cassandra

Reality

● We needed to have a computer to use

Cassandra

● We needed to be willing to learn more about

Java, Linux commands, and Cassandra

internals as we went

Accepted Anti-paterns

● Atomicity and Read-Before-Write

● Distributed Joins

● Lookup Tables

● Delete-And-Insert Updates

● Queues

Atomicity and Read before Write● Read before write is often described as an anti-pattern

○ Potential inconsistency or Check and Set (CAS) / Lightweight Transaction (LWT) operations

○ Event sourcing may be an alternative

● There isn’t always an alternative

● Kinetic Request uses LWTs for “Optimistic Locking” (and uniqueness)

● Even at an order of magnitude slower, more than fast enough at our scale

○ < 10ms with Replication Factor 3

○ Order of magnitude faster than what is necessary for us

Atomicity and Read before WriteSample Schema

CREATE TABLE IF NOT EXISTS widgets ( name text, tenant_id timeuuid, value text, version_id timeuuid, PRIMARY KEY ((tenant_id), name)) WITH CLUSTERING ORDER BY (name ASC);

Uniqueness

INSERT INTO widgets (name, tenant_id, value, version_id) VALUES (:name, :tenant_id, :value, :version_id) IF NOT EXISTS

Optimistic Locking

UPDATE widgets

SET value = :value WHERE tenant_id = :tenant_id AND name = :name IF version_id = :version_id

Distributed Joins● Cassandra doesn’t support joins, but you can do them in memory

○ Requires multiple sequential reads and/or multi-partition queries

○ Embedded or denormalized content is sometimes an alternative

● We use a distributed join between Submissions and Forms

○ Allows us to rename the form and maintain the link

○ Acceptable because forms are finite enough to keep in memory

● Christopher Batey has a great blog post on this: http://christopher-batey.blogspot.

com/2015/02/cassandra-anti-pattern-distributed.html

Submissions Schema

CREATE TABLE IF NOT EXISTS submissions ( form_id timeuuid, id timeuuid, tenant_id timeuuid, ... PRIMARY KEY ((tenant_id), id)) WITH CLUSTERING ORDER BY (name ASC);

Distributed JoinsForms Schema

CREATE TABLE IF NOT EXISTS forms ( name text, tenant_id timeuuid, ... PRIMARY KEY ((tenant_id), name)) WITH CLUSTERING ORDER BY (name ASC);

SELECT * FROM submissions WHERE tenant_id = :tenant_id AND id = :id;SELECT * FROM forms WHERE tenant_id = :tenant_id AND id = :submission_form_id;

Lookup Tables● Lookup Tables are another form of Distributed Join

○ Table contains only data necessary for the query and an id used to lookup from the source of truth

○ Requires a “multi-get” to retrieve actual records

○ Often considered an anti-pattern for similar reasons as distributed joins

● We use a lookup tables for Webhooks and Submissions

○ Duplicating data would lead to storage requirements orders of magnitude higher

Lookup TablesCREATE TABLE IF NOT EXISTS webhooks ( id timeuuid, scheduled_at timestamp, tenant_id timeuuid, ... PRIMARY KEY ((tenant_id), id)) WITH CLUSTERING ORDER BY (id ASC);

CREATE TABLE IF NOT EXISTS webhooks_index ( bucket text, id timeuuid, index_type text, // Tenant, Webhook, Parent index_key text, scheduled_at timestamp, tenant_id timeuuid, PRIMARY KEY ((tenant_id, bucket, index_type, index_key), scheduled_at, id)) WITH CLUSTERING ORDER BY (scheduled_at DESC, id DESC) ...;

Delete-And-Insert Updates● Fundamental problem:

○ Cassandra retrieves by primary key

○ User’s want to search by values that change

○ Updating a primary key is done as a DELETE and INSERT (which leads to tombstones)

○ Want to minimize environmental complexity

● No simple solution for us

○ Try to minimize number of deletes for a given query path

○ Try to optimize for tombstones

Delete-And-Insert Updates● The biggest source of our DELETE-AND-INSERT usage to support our Ad-hoc

querying of submissions

● Example Ad-hoc query:

values[Foo] IN ("Bar", "Baz") AND ( values[Requested By]="ben.christenson" OR values[Requested For]="ben.christenson")

● Our solution is similar to the C* Summit presentation on multi-criteria queries

http://fr.slideshare.net/ippontech/multi-criteria-queries-on-a-cassandra-application

Delete-And-Insert UpdatesWriting

● Read record from Cassandra

(including version_id)

● An “Indexer” class generates index sets from

original and updated model

● Optimistically update the source of truth

record

● Asynchronously create/delete necessary

index records

Reading

● Each criterion is a separate async query

● An in memory evaluator aggregates the

lookup table IDs

● The submissions associated to the resulting

IDs are each retrieved asynchronously

● The in memory evaluator “double checks”

the submissions match the query and may

re-execute another search to fill in gaps for

submissions that have been updated since

the initial index queries (very rare)

Delete-And-Insert UpdatesCREATE TABLE IF NOT EXISTS submissions_index ( tenant_id timeuuid, timeline text, bucket text, // ‘’ for active or ‘YYYY-mm’ key text, value text, timestamp timestamp, submission_id timeuuid, PRIMARY KEY ((tenant_id, timeline, bucket, key), value, timestamp, submission_id)) WITH CLUSTERING ORDER BY (value DESC, timestamp DESC, submission_id DESC) AND COMPACTION={ 'sstable_size_in_mb': '256', 'tombstone_threshold': '0.05', 'unchecked_tombstone_compaction': 'true', 'tombstone_compaction_interval': '3600', 'class': 'LeveledCompactionStrategy'};

Delete-And-Insert Updates● Even with Delete-And-Insert and lookup tables, performance is acceptable

○ Supports queries that were previously impossible

○ Extremely complicated search queries still return in < 150ms

● Does have some caveats

○ Only supports AND, OR, IN, and =

(would like to support !=, starts with, ends with, etc)

○ Whenever an AND is used at least one of the criterions must return less than 1000 matches

○ In order to support pagination, sort orders must be indexed independently

(combination of date and uuid; we index multiple date properties)

Queues● Queues are one of the most commonly referred to anti-patterns

● Problem comes down to tombstones again

○ Can be improved by truncating, knowing where live data begins, or complicated rotations

○ Can be improved by including additional technologies (real message queue)

Queues● One of the queue-like structures used by Kinetic Request is for Webhooks

○ Can fail to connect and should be automatically retried (put back on queue)

○ Happen often and have a very specific query path so tombstones are worrisome

● In this case, the queue “event” can be processed initially by the server in memory

○ Write directly to the source of truth / historical index and avoid tombstones for normal executions

○ Only if the initial webhook fails is it added to the queue index

Queues● Other styles of queues can’t necessarily be processed by the event server

○ Scheduled for the future

○ Handled by

● For this case, we are experimenting using an in-memory distributed queue

○ Hazelcast or Ignite (which have many other coordination benefits)

○ Avoids hitting tombstones by using Cassandra as a persistence mechanism only queried at startup

Takeaways

● Cassandra has many benefits,

even if you are not using it at

extreme scales

● The barrier of entry is not as

scary as it seems

● Play, play, play, test, test, test

● Find good resources

Questions?

Fears, misconceptions, and accepted anti patterns of a first time cassandra adopter

Technology

Transcript of Fears, misconceptions, and accepted anti patterns of a first time cassandra adopter