Apache Cassandra - part 2
-
Upload
diego-pacheco -
Category
Technology
-
view
74 -
download
1
Transcript of Apache Cassandra - part 2
Cassandrapart 2
diegopacheco@diego_pacheco
Diego Pacheco
@diego_pacheco
❏ Cat's Father ❏ Principal Software Architect❏ Agile Coach❏ SOA Expert❏ DevOps Practitioner❏ Speaker❏ Author
diegopacheco
http://diego-pacheco.blogspot.com.br/
https://goo.gl/eEqvzl
About me...
Agenda
❏ RE-CAP ❏ Cassandra Write Path❏ Tombstones❏ Compaction Strategies❏ Row Cache❏ Bloom Filter❏ SASI Index❏ Materialized Views❏ Counter Families❏ Anti-Patterns❏ Cassandra running at UBER in MESOS use case❏ Q&A
RE-CAP: Partition Strategy
Cassandra Write Path
❏ SSTable => Sorted Array of Strings.❏ Write to Disk: Merges and Pre-sorts
happens.❏ SSTables are IMMUTABLE.❏ Compaction happens:
❏ Time to time❏ Prune deleted data❏ Has thread-offs
Tombstones❏ Deleted data is MARKED as Removed == Tombstone❏ Data is deleted and removed during compaction❏ Compaction can happen in few days depending of the
configs.❏ Queries on partition with lots of tombstones requires lots of
filtering which can slow down the CASS performance.❏ Collections operations can lead to tombstones depending
on what you do. ❏ There are Compaction Trade-Offs.
Compaction Strategies
❏ STCS❏ Default❏ Insert-Heavy ❏ General Workloads
❏ LCS❏ Read Heavy❏ More Updates than
Inserts❏ DTCS
❏ Time Series❏ Inserts out of order❏ Updates for old data
Cassandra ROW CACHE❏ Buffer FULL merged row into memory❏ Increase a lot the throughput ❏ Row Cache works with Key Cache❏ Key Cache = Where the partition is on DISK.
CREATE TABLE status ( user text, status_id timeuuid, status text, PRIMARY KEY (user, status_id)) WITH CLUSTERING ORDER BY (status_id DESC) AND caching = '{"keys":"ALL", "rows_per_partition":"10"}'
Cassandra Bloom Filter❏ Bloom Filter: Technique created on the 70s to filter db matches.❏ Space Efficient❏ Probabilistic Data Structures❏ For each SSTable there is a Bloom Filter❏ Used for Index scans - not used to range scans❏ Stored OFF HEAP❏ Tunable per TABLE❏ Cassandra uses bloom filters to know if the data is on the ROW or not.
Cassandra READ Path
SASI❏ Secondary Index: Not the primary key.❏ Lookup tables: bySomething❏ Distributed Index❏ Search Like Capabilities: %diego%❏ Great when:
❏ Multi fields Search❏ You know the partition key❏ Indexing static columns
❏ Issues:❏ More than 1000 rows returned❏ Searching in Large Partitions❏ Aggressive Read SLOs❏ Search for Analytics(Use Spark/Flink)❏ Ordering Search is important
SASI
Samples
❏ SELECT * FROM users WHERE firstname LIKE 'Die%';
❏ SELECT * FROM users WHERE lastname LIKE '%ie%';
❏ SELECT * FROM users WHERE created_date > '2015-01-02' AND created_date < '2017-01-02';
Materialized Views
❏ Automated - Table managed for you, Denormalization ❏ Copies of the data in different partitions / replicas❏ Some Write penalty but acceptable performance❏ Store results in table which can be indexed❏ Update ASYNC❏ Great For:
❏ Caching❏ Result Sets❏ Dashbaords
SAMPLE
CREATE MATERIALIZED VIEW all_time_high AS SELECT user FROM scores WHERE game IS NOT NULL AND score IS NOT NULL PRIMARY KEY (game,score) WITH CLUSTERING ORDER BY (score DESC)
Cassandra Counter Family❏ Static VS Dynamic Column families❏ Dynamic Column families A.K.A Wide Rows❏ Wide Rows is good for: Ordering,Grouping and Filtering.❏ Wide Rows are not split into NODES.❏ Counters Internally:
❏ Calculated and sum of all replicas❏ Split into fragments called SHARDs.❏ Logical clock monotonically increased❏ 3 tuple = { NODE_COUNTER_ID, SHARD_LOGICAL_CLOCK, SHARD_VALUE }
Anti-Patterns❏ Using Cassandra as a queue or queue-like table
❏ Tombstones❏ Lots of deleted columns(expiry) and slice-queries don't play well❏ http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
❏ CQL Nulls❏ Reading Tombstones❏ Write NULL create tombstones
❏ Intensive Updates on SAME column❏ Sensor table (ID,VALUE)❏ Physical Limits❏ Solution: Timestamp as cluster key.
Cassandra at UBER using MESOS (2016 data)
Cassandrapart 2
diegopacheco@diego_pacheco
Diego Pacheco