Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014
-
Upload
odnoklassnikiru -
Category
Technology
-
view
5.227 -
download
5
description
Transcript of Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014
ok.ru* 45M daily, 80M monthly audience
* Top 4 social networking site
* Top 7 on total time on site in the world*
* ~ 500,000 http reqs/sec
* > 400 Gbps out
* > 8000 iron servers in 5 DCs, ~1ms ping
* comScore data on July 2014, desktops, users of 15+ age
* Since 2010 - 0.6-ok, 1.2, 2.0
* In 2014 - 33 clusters - > 600 storage nodes - 330 TB
* Fastest :1.5M ops (48 nodes)
* Largest : 130TB (96 nodes)
Cassandra at
SQL Server 2005
* Consistent (ACID) OLTP data
* 200 servers, 50 TB of data
* Sharding • F(Entity_Id) -> Token -> SQL Server Node • F(Master_Id) === F(Detail_Id)
* Local node commit only
Fast SQL Server 2005
* DB JOIN
* Foreign key constraints
* Stored Procs, Triggers
* Read uncommitted (noTx)
* Short lived transactions <100ms
* No massive UPDATEs, DELETEs
* Always query on indexed data
Usual SQL shortcomings
* Manual “scale out” with downtime
* Downtime on maintenance
* Write performance
* BSoD, swap outs, magic
* Expensive HA hardware (10x 1U server price)
* Fragile failover - ~ 10% failovers fail
* Downtime on DC failure or partition
Simple transaction in SQL ServerTX.start(“Albums”, id);Album album = albums.lock(id);Photo photo = photos.create(…);
if (photo.status == PUBLIC ) {album.incPublicPhotosCount();
}
TX.commit();
* Read - modify - write
* Involves a few records, different tables
* Possibility of concurrent transactions on 1 key
Usual NoSQL problems
* Learning curve
* Sophisticated development - Often rewrite from scratch, data model and UI - Often with omission of functionality
* Distributed programming means - (A lot of) app specific code around consistency,
conflicts resolution, retries and rollbacks
* Ad-hoc, fragile and buggy ACID implementation
We need a New Storage
* Fast to learn and develop - ACID - SQL
* Easy to operate and maintain: - Read and modify on DC failure - Automatic scale out w/o downtime - Commodity hardware
* Fixable codebase (OpenSource,Java)
TODO: SQL
* Scale out
* Availability - Cluster - Conflict resolution - SQL
* ACID
* SQL
* Cassandra 2 CQL
NoSQL ?
- OR -
Cassandra 2.0* Implements out of the box
- CQL - Automatic scale out - Good write perf - Quorums, speculative retry ( see also CASSANDRA-6866 ) - Logged Batch - “Lightweight” transactions ?
Read - modify - write Possibility of concurrent transactions on 1 key Involves a few records, different tables
“3 phase commit” -> slow
Cassandra 2.0* Implements out of the box
- CQL - Automatic scale out - Good write perf ( https://github.com/jbellis/YCSB ) - Quorums, speculative retry ( see also CASSANDRA-6866 ) - Logged Batch - “Lightweight” transactions - Secondary indexes ?
C*One
* ACID transactions - No SpOF, DC failure resistant - Across multiple tables and partitions - Commits and rollbacks
* First class indexes - No additional coding - Online build on existing data
CassandraGossip & Messaging
clients C* Storage nodes
C*OneUpdate services
Schema
Partitioner
“Heartbeat”
Cluster topology
C*One
clients > 800
(all java)
* Fat client mode
* Client is its own coordinator
* Faster
* -1 point of failure -> more reliable
Clients
clients
NoTx
C*OneUpdate services
In Tx
Clients
C*One Update Srvs
* Manages pessimistic locks
* Generates monotonic timestamp for cells
* Manages transactions
* Failure management
Lamport Timestamp http://en.wikipedia.org/wiki/Lamport_timestamps
C*OneUpdate services
00
10
20
30
40
50
* Transaction Group Masters
* Simple in-memory locking
Locks mgmt
00
10
20
30
50
40
* Each to every heartbeat
* Quorum cluster view(I am dead if Q say so)
* 50ms tick
* G1 GC
* 200ms till failure detection
DC-1 DC-2 DC-3
Heartbeat Quorum
Failure detection
50
* Master election protocol
* Speculative transaction start
50’
50”
clients > 800
start Tx
Failure management
Unborn transactions
* Transacion start requests queue - (in substitute’s memory) - Thrown away after timeout
* On range master failure - queue is being processed - send started replies to clients
(declines if already opened)
clients
Locks table
1. StartTx
id=1, a=1, b=1
2. Lock
3. Read
4. Cache
Transaction state
RAMTx start
Transaction state
Locks table
1. UPDATE
id=1, a=1, b=1
2. File
id=1, a=2, b=2
RAMTx write
clients
Transaction state
Locks table
1. Read
id=1, a=1, b=1id=1, a=2, b=2
2. Read ?
3. resolve()
RAMTx read
clients
Transaction state
Locks table
1. Commit
id=1, a=2, b=2
RAM
LOGGED BATCH
2
3
4. Ack
Tx commit
clients
1. Rollback
RAM
Transaction state
Locks table
id=1, a=2, b=2
Tx rollback
clients
ACID
* Atomicity - logged batch or nothing
* Consistency - application, rollback
* Isolation - Locks - Read Committed
* Durability - quorum reads and writes to Cassandra
Indexes in Cassandra 2
* CREATE INDEX (owner, modified ) ? - No composite index support - High cardinality - Don’t scale (synchronous full cluster scan on read) - Max 100K tombstones per index
CREATE TABLE photos ( id bigint primary key, owner bigint, modified timestamp
SELECT * WHERE owner=? AND modified>?
Primary Key
id owner modified caption access …
1 111 9.10.2014 “kitty cat” PUB …
INDEX i1 ON photos (owner, modified) VALUES (caption,access,…);
Primary Key
Partition Key Clustering Key
owner modified id caption access …111 9.10.2014 1 “kitty cat” PUB …
SELECT * WHERE owner=? AND modified>?
SELECT * FROM i1_photoWHERE owner=? AND modified>?
Global Indexes in C*One
UPDATE
RAM
Transaction state
id=1, a=1, b=1id=1, a=2, b=2
Schema2. idxwrites()
idx: a=2, b=2, id=1
Index
clients
* Indexes “a la SQL” - Consistent - On more than 1 column - Scalable and fast - Built into CQL - No additional coding required - Very little penalty (+1 write)
ACID
Production: Photos
* 11 bi photos
* 80k reads/sec, 2k-8k tx/sec
* SQL - RF=1 (+1 on RAID 10, +3 in backups) - 32 MS SQL + 16 standby + 10 backup = 58 - load =100%
* C*One - RF=3 ( in each DC ) - 63 C* + 6 upd = 69, 1/3 price - load = 30%
* Tx failures 8500 /day -> 85/day
* Avg Tx timespan: <40ms
* Commit latency avg: <2ms
* Read, write, avg <2ms, 99% ~ 3ms
Photos: numbers
C*
* 22 patches to issues.apache.org - range thombstone and queries fixes, optimizations,
etc.
* Commit log on the fly compression(CASSANDRA-7994)
* Reliable always retry policy(CASSANDRA-6866)
* Night of the Living Dead(CASSANDRA-7872)
Oleg Anastasyev [email protected] ok.ru/oa @m0nstermind
slideshare.net/m0nstermind
http://v.ok.ru
T H A N K Y O U !