C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
-
Upload
planet-cassandra -
Category
Technology
-
view
1.118 -
download
3
description
Transcript of C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
![Page 1: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/1.jpg)
#CASSANDRAEU CASSANDRASUMMITEU
Richard Low | @richardalow
Mixing Batch and Real-time: Cassandra with Shark
![Page 2: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/2.jpg)
#CASSANDRAEU @richardalow
About me*Analytics tech lead at SwiftKey*Cassandra freelancer*Previous: lead Cassandra and Analytics dev at
Acunu
![Page 3: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/3.jpg)
#CASSANDRAEU @richardalow
Outline*Batch analytics on real-time databases*Current solutions*Spark and Shark*My solution*Performance results*Summary & future work
![Page 4: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/4.jpg)
#CASSANDRAEU @richardalow
Batch analytics on real-time databases
![Page 5: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/5.jpg)
#CASSANDRAEU @richardalow
Batch and real-time analytics*Wherever there is data there are unforeseeable
queries*Real-time databases are optimized for real-time
queries*Large queries may not be possible*Or will impact your real-time SLA
![Page 6: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/6.jpg)
#CASSANDRAEU @richardalow
Example*User accounts database*Read-heavy*Must be low latency*Other tables on same database*Some are write heavy*A good fit for Cassandra!
![Page 7: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/7.jpg)
#CASSANDRAEU @richardalow
Example data model
CREATE TABLE user_accounts ( userid uuid PRIMARY KEY, username text, email text, password text, last_visited timestamp, country text);
![Page 8: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/8.jpg)
#CASSANDRAEU @richardalow
Example data modelSELECT * FROM user_accounts LIMIT 2;
userid | country | email | last_visited | password | username---------+---------+---------------------+---------------------+----------+---------a03dcf03 | UK | [email protected] | 2013-10-07 09:07:36 | td7rjxwp | rlowb3f1871e | FR | [email protected] | 2013-08-17 13:07:36 | moh7eksn | jean88
![Page 9: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/9.jpg)
#CASSANDRAEU @richardalow
Marketing walks in
![Page 10: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/10.jpg)
#CASSANDRAEU @richardalow
Ad-hoc query
“Please can you find all users from Brazil who haven’t logged in since July and have an email @yahoo.com.
I need the answer by Monday.”
![Page 11: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/11.jpg)
#CASSANDRAEU @richardalow
Ad-hoc query observations*We have 500k users from Brazil*60MB of raw data*No way to extract by country from data model*It’s on unchanging data**Can take hours, not days*No expectation this query will need rerunning
* Mostly, some of the people who haven’t visited for a while may suddenly come back
![Page 12: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/12.jpg)
#CASSANDRAEU @richardalow
Why?*Underrepresented use case in plethora of tools*Seen days of dev time wasted*Want to see what can be done
![Page 13: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/13.jpg)
#CASSANDRAEU @richardalow
Current solutions
![Page 14: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/14.jpg)
#CASSANDRAEU @richardalow
Options*Run Hive query on top of Cassandra
![Page 15: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/15.jpg)
#CASSANDRAEU @richardalow
*Run Hive query on top of CassandraOptions*Run Hive query on top of Cassandra
*Will compete with Cassandra for*I/O*Memory*CPU*Network
*Will cause extra GC pressure on Cassandra*Could flush filesystem cache
![Page 16: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/16.jpg)
#CASSANDRAEU @richardalow
Options*Write ETL script and load into another DB
![Page 17: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/17.jpg)
#CASSANDRAEU @richardalow
Options*Write ETL script and load into another DB*Write ETL script and load into another DB
*All custom code*Single threaded*Unreliable*Will still flush cache on Cassandra nodes
![Page 18: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/18.jpg)
#CASSANDRAEU @richardalow
Options*Clone the cluster
![Page 19: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/19.jpg)
#CASSANDRAEU @richardalow
Options*Clone the cluster*Clone the cluster
*Worst possible network load*Manual import each time*No incremental update*Need duplicate hardware
![Page 20: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/20.jpg)
#CASSANDRAEU @richardalow
Options*Add ‘batch analytics’ DC and run Hive there
![Page 21: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/21.jpg)
#CASSANDRAEU @richardalow
Options*Add ‘batch analytics’ DC and run Hive there*Add ‘batch analytics’ DC and run Hive there
*Initial copy slow and affects real-time performance
*Need duplicate hardware*Will drop writes when really busy
![Page 22: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/22.jpg)
#CASSANDRAEU @richardalow
Spark and Shark
![Page 23: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/23.jpg)
#CASSANDRAEU @richardalow
Spark*Developed by Amplab*Distributed computation, like Hadoop*Designed for iterative algorithms*Much faster for queries with working sets that fit
in RAM*Reliability from storing lineage rather than
intermediate results*Runs on Mesos or YARN
![Page 24: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/24.jpg)
#CASSANDRAEU @richardalow
Spark is used by
Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
![Page 25: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/25.jpg)
#CASSANDRAEU @richardalow
Shark*Hive on Spark*Completely compatible with Hive*Same QL, UDFs and storage handlers*Can cache tables
![Page 26: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/26.jpg)
#CASSANDRAEU @richardalow
Shark*Hive on Spark*Completely compatible with Hive*Same QL, UDFs and storage handlers*Can cache tables
CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’;
![Page 27: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/27.jpg)
#CASSANDRAEU @richardalow
Shark on Cassandra
![Page 28: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/28.jpg)
#CASSANDRAEU @richardalow
Shark on Cassandra* CqlStorageHandler*Can use existing hive-cassandra storage handler*Can work well - see Evan Chan’s talk (Ooyala) from
#cassandra13*But suffers from same problems as Hive+Hadoop
on Cassandra
![Page 29: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/29.jpg)
#CASSANDRAEU @richardalow
Shark on Cassandra direct* SSTableStorageHandler*Run spark workers on the Cassandra nodes*Read directly from SSTables in separate JVM*Limit CPU and memory through Spark/Mesos/
YARN*Limit I/O by rate limiting raw disk access*Skip filesystem cache
![Page 30: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/30.jpg)
#CASSANDRAEU @richardalow
Cassandra on Spark: through CQL interface
Cassandra JVM
Spark worker JVM
DeserializeMergeSerialize
DeserializeProcess
Remote client
FS CacheSSTables
Latency spikes!
![Page 31: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/31.jpg)
#CASSANDRAEU @richardalow
Cassandra on Spark: SSTables direct
Cassandra JVM
Spark worker JVM
DeserializeMergeSerialize
DeserializeProcess
Remote client
SSTables
Constant latency
FS Cache
![Page 32: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/32.jpg)
#CASSANDRAEU @richardalow
Disadvantages*Equivalent to CL.ONE*Always runs task local with the data*Doesn’t read data in memtables
![Page 33: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/33.jpg)
#CASSANDRAEU @richardalow
Performance results
![Page 34: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/34.jpg)
#CASSANDRAEU @richardalow
Testing*4 node Cassandra cluster on m1.large
*2 cores, 7.5 GB RAM, 2 ephemeral disks*1 spark master*Spark running on Cassandra nodes*Limited to 1 core, 1 GB RAM*Compare CQLStorageHandler with
SSTableStorageHandler
![Page 35: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/35.jpg)
#CASSANDRAEU @richardalow
Setup*Cassandra 1.2.10*3 GB heap*256 tokens per node*RF 3*Preloaded 100M randomly generated records
*Each node started with 9GB of data*No optimization or tuning
![Page 36: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/36.jpg)
#CASSANDRAEU @richardalow
Tools*codahale Metrics*Ganglia*Load generator using DataStax Java driver*Google spreadsheet
![Page 37: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/37.jpg)
#CASSANDRAEU @richardalow
Result 1*No Cassandra load*Run caching query:
*Takes 33 mins through CQL*Takes 13 mins through SSTables
*130k records/s*=> SSTables is 2.5x faster*Even better since CQL has access to both cores
CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’;
![Page 38: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/38.jpg)
#CASSANDRAEU @richardalow
Using cached results*Now have results cached, can run super fast
queries*No I/O or extra memory*Bounded number of cores
*Took 18 seconds
SELECT count(*) FROM user_accounts_cached WHERE unix_timestamp(last_visited)< unix_timestamp('2013-08-01 00:00:00') AND email LIKE '%@c9%';
![Page 39: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/39.jpg)
#CASSANDRAEU @richardalow
Result 2*Add read load
*Read-modify-write of accounts info*200 ops/s*Measure latency
*Slow down SSTable loader to same rate as CQL
![Page 40: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/40.jpg)
#CASSANDRAEU @richardalow
95%ile base
mean base
![Page 41: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/41.jpg)
#CASSANDRAEU @richardalow
Analysis*Average latency 17% lower
*Probably due to less CPU used by query*Max 95th %ile latency 33% lower and much more
predictable*Possibly due to less GC pressure
*Still have a latency increase over base*Probably due to I/O use
![Page 42: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/42.jpg)
#CASSANDRAEU @richardalow
Result 3*Keep read workload*Measure same latency*Add insert workload
*Insert into separate table*2500 ops/s
![Page 43: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/43.jpg)
#CASSANDRAEU @richardalow
CQL loader SSTable loader
![Page 44: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/44.jpg)
#CASSANDRAEU @richardalow
Analysis*Lots of latency, but there is anyway
![Page 45: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/45.jpg)
#CASSANDRAEU @richardalow
Performance wrap up*2.5x faster with less CPU
=> uses less resources to do the same thing*Lower, more predictable latencies when at same
speed=> controlled resource usage lowers latency impact
*Could limit further to make impact unnoticeable
![Page 46: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/46.jpg)
#CASSANDRAEU @richardalow
Summary
![Page 47: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/47.jpg)
#CASSANDRAEU @richardalow
Summary*Discussed analytics use case not well served by
current tools*Spark, Shark*SSTableStorageHandler*Performance results
![Page 48: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/48.jpg)
#CASSANDRAEU @richardalow
Future*Needs a name*Github*Speak to me if you want to use it*Speak to me if you want to contribute
![Page 49: C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark](https://reader033.fdocuments.net/reader033/viewer/2022052905/558569fbd8b42a4c298b4f9a/html5/thumbnails/49.jpg)
#CASSANDRAEU @richardalow
Thank you!Richard Low | @richardalow