TupleJump: Breakthrough OLAP performance on Cassandra and Spark

70
Breakthrough OLAP Performance with Cassandra and Spark Evan Chan September 2015

Transcript of TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Page 1: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Breakthrough OLAPPerformance with

Cassandra and SparkEvan Chan

September 2015

Page 2: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Who am I?

Distinguished Engineer, @evanfchan

User and contributor to Spark since 0.9, Cassandra since 0.6Co-creator and maintainer of

Tuplejump

http://velvia.github.io

Spark Job Server

Page 3: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

About Tuplejump is a big data technology leader providing solutions for

rapid insights from data.Tuplejump

- the first Spark-Cassandra integration - an open source Lucene indexer for Cassandra - open source HDFS for Cassandra

CalliopeStargateSnackFS

Page 4: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Didn't I attend the same talk last year?Similar title, but mostly new materialWill reveal new open source projects! :)

Page 5: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Problem SpaceNeed analytical database / queries on structured big data

Something SQL-like, very flexible and fastPre-aggregation too limiting

Fast data / constant updatesIdeally, want my queries to run over fresh data too

Page 6: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Example: Video analyticsTypical collection and analysis of consumer events3 billion new events every dayVideo publishers want updated stats, the sooner the betterPre-aggregation only enables simple dashboard UIsWhat if one wants to offer more advanced analysis, or ageneric data query API?

Eg, top countries filtered by device type, OS, browser

Page 7: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

RequirementsScalable - rules out PostGreSQL, etc.Easy to update and ingest new data

Not traditional OLAP cubes - that's not what I'm talkingabout

Very fast for analytical queries - OLAP not OLTPExtremely flexible queriesPreferably open source

Page 8: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

ParquetWidely used, lots of support (Spark, Impala, etc.)Problem: Parquet is read-optimized, not easy to use for writes

Cannot support idempotent writesOptimized for writing very large chunks, not small updatesNot suitable for time series, IoT, etc.Often needs multiple passes of jobs for compaction of smallfiles, deduplication, etc.

 

People really want a database-like abstraction, not a file format!

Page 9: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Cassandra

Horizontally scalableVery flexible data modelling (lists, sets, custom data types)Easy to operatePerfect for ingestion of real time / machine dataBest of breed storage technology, huge communityBUT: Simple queries onlyOLTP-oriented

Page 10: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Apache Spark

Horizontally scalable, in-memory queriesFunctional Scala transforms - map, filter, groupBy, sortetc.SQL, machine learning, streaming, graph, R, many more pluginsall on ONE platform - feed your SQL results to a logisticregression, easy!Huge number of connectors with every single storagetechnology

Page 11: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Spark provides the missing fast, deepanalytics piece of Cassandra!

...tying together fast event ingestion and rich deepanalytics!

Page 12: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

How to make Spark and CassandraGo Fast

Page 13: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Spark on Cassandra: No Caching

Page 14: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Not Very Fast, but Real-Time UpdatesSpark does no caching by default - you will always be readingfrom C*!Pros:

No need to fit all data in memoryAlways get the latest data

Cons:Pretty slow for ad hoc analytical queries - using regular CQLtables

Page 15: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

How to go Faster?Read less dataDo less I/OMake your computations faster

Page 16: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Spark as Cassandra's Cache

Page 17: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Caching a SQL Table from CassandraDataFrames support in Cassandra Connector 1.4.0 (and 1.3.0):

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read .format("org.apache.spark.sql.cassandra") .option("table", "gdelt") .option("keyspace", "test").load()df.registerTempTable("gdelt")sqlContext.cacheTable("gdelt")sqlContext.sql("SELECT count(monthyear) FROM gdelt").show()

 

Page 18: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

How Spark SQL's Table Caching Works

Page 19: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Spark Cached Tables can be Really FastGDELT dataset, 4 million rows, 60 columns, localhost

Method secsUncached 317

Cached 0.38

 

Almost a 1000x speedup!

On an 8-node EC2 c3.XL cluster, 117 million rows, can runcommon queries 1-2 seconds against cached dataset.

Page 20: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Problems with Cached TablesStill have to read the data from Cassandra first, which is slowAmount of RAM: your entire data + extra for conversion tocached tableCached tables only live in Spark executors - by default

tied to single context - not HAonce any executor dies, must re-read data from C*

Caching takes time: convert from RDD[Row] to compressedcolumnar formatCannot easily combine new RDD[Row] with cached tables(and keep speed)

Page 21: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Problems with Cached TablesIf you don't have enough RAM, Spark can cache your tablespartly to disk. This is still way, way, faster than scanning an entireC* table. However, cached tables are still tied to a single Sparkcontext/application.

Also: rdd.cache() is NOT the same as SQLContext'scacheTable!

Page 22: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Faster Queries Through Columnar StorageWait, I thought Cassandra was columnar?

Page 23: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

How Cassandra stores your CQL TablesSuppose you had this CQL table:

CREATE TABLE ( department text, empId text, first text, last text, age int, PRIMARY KEY (department, empId));

Page 24: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

How Cassandra stores your CQL TablesPartitionKey 01:first 01:last 01:age 02:first 02:last 02:ageSales Bob Jones 34 Susan O'Connor 40

Engineering Dilbert P ? Dogbert Dog 1

 

Each row is stored contiguously. All columns in row 2 come afterrow 1.

To analyze only age, C* still has to read every field.

Page 25: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Cassandra is really a row-based, OLTP-oriented datastore.

Unless you know how to use it otherwise :)

Page 26: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

The traditional row-based data storageapproach is dead- Michael Stonebraker

Page 27: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Columnar StorageName column

0 10 1

 

Dictionary: {0: "Barak", 1: "Hillary"}

 

Age column

0 146 66

Page 28: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Columnar Format solves I/OHow much data can I query interactively? More than you think!

Page 29: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

 

Columnar Storage Performance Studyhttp://github.com/velvia/cassandra-gdelt

Scenario Ingest Read allcolumns

Read onecolumn

Narrowtable

1927sec

505 sec 504 sec

Widetable

3897sec

365 sec 351 sec

Columnar 93 sec 8.6 sec 0.23 sec 

On reads, using a columnar format is up to 2190x faster, whileingestion is 20-40x faster.

Page 30: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Columnar Format solves CachingUse the same format on disk, in cache, in memory scan

Caching works a lot better when the cached object is thesame!!

No data format dissonance means bringing in new bits of dataand combining with existing cached data is seamless

Page 31: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

So, why isn't everybody doing this?No columnar storage format designed to work with NoSQLstoresEfficient conversion to/from columnar format a hard problemMost infrastructure is still row oriented

Spark SQL/DataFrames based on RDD[Row]Spark Catalyst is a row-oriented query parser

Page 32: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

All hard work leads to profit, but mere talk leadsto poverty.- Proverbs 14:23

Page 33: TupleJump: Breakthrough OLAP performance on Cassandra and Spark
Page 34: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Introducing FiloDBDistributed. Versioned. Columnar. Built for Streaming.

 

github.com/tuplejump/FiloDB

Page 35: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

FiloDB - What?

Page 36: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

DistributedApache Cassandra. Scale out with no SPOF. Cross-datacenterreplication. Proven storage and database technology.

Page 37: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

VersionedIncrementally add a column or a few rows as a new version. Easilycontrol what versions to query. Roll back changes inexpensively.

Stream out new versions as continuous queries :)

Page 38: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

ColumnarParquet-style storage layoutRetrieve select columns and minimize I/O for analyticalqueriesAdd a new column without having to copy the whole tableVectorization and lazy/zero serialization for extremeefficiency

Page 39: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

What's in the name?

Rich sweet layers of distributed, versioned database goodness

Page 40: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

100% ReactiveBuilt completely on the Typesafe Platform:

Scala 2.10 and SBTSpark (including custom data source)Akka Actors for rational scale-out concurrencyFutures for I/OPhantom Cassandra client for reactive, type-safe C* I/OTypesafe Config

Page 41: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Spark SQL Queries!CREATE TEMPORARY TABLE gdeltUSING filodb.sparkOPTIONS (dataset "gdelt");

SELECT Actor1Name, Actor2Name, AvgTone FROM gdelt ORDER BY AvgTone DESC LIMIT 15

Read to and write from Spark DataframesAppend/merge to FiloDB table from Spark StreamingUse Tableau or any other JDBC tool

Page 42: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

FiloDB - Why?Fast Streaming Data + Big Data, All in One!

Page 43: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Analytical Query PerformanceUp to 200x Faster Queries for Spark on Cassandra 2.x

Parquet Performance with Cassandra Flexibility(Stick around for the demo)

Page 44: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Fast Event/Time-Series Ad-Hoc Analytics

New rows appended via KafkaWrites are idempotent - no need to dedup!Converted to columnar chunks on ingest and stored in C*Only necessary columnar chunks are read into Spark forminimal I/O

Page 45: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Fast Event/Time-Series Ad-Hoc AnalyticsEntity Time1 Time2US-0123 d1 d2

NZ-9495 d1 d2 

Model your time series with FiloDB similarly to Cassandra:

Sort key: Timestamp, similar to clustering keyPartition Key: Event/machine entity

FiloDB keeps data sorted while stored in efficient columnarstorage.

Page 46: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

FiloDB = Streaming + Columnar

Page 47: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Extend your Cassandra (2.x) Investment

Make it work for batch and ad-hoc analytics!

Page 48: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Simplify your Lambda Architecture...

( )https://www.mapr.com/developercentral/lambda-architecture

Page 49: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

With Spark, Cassandra, and FiloDB

Ma, where did all the components go?You mean I don't have to deal with Hadoop?Use Cassandra as a front end to store IoT data first

Page 50: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

FiloDB vs ParquetComparable read performance - with lots of space to improve

Assuming co-located Spark and CassandraOn localhost, both subsecond for simple queries (GDELT1979-1984)FiloDB has more room to grow - due to hot column cachingand much less deserialization overhead

Lower memory requirement due to much smaller block sizesMuch better fit for IoT / Machine / Time-series applications

Idempotent writes by PK with no deduplicationLimited support for types

array / set / map support not there, but will be added

Page 51: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

FiloDB - How?

Page 52: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

FiloDB Architecture

Page 53: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Ingestion and Storage?Current version:

Each dataset is stored using 2 regular Cassandra tablesIngestion using Spark (Dataframes or SQL)

Future version?

Automatic ingestion of your existing C* data using customsecondary index

Page 54: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Towards Extreme Query Performance

Page 55: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

The filo project is a binary data vector library

designed for extreme read performance with minimaldeserialization costs.

http://github.com/velvia/filo

Designed for NoSQL, not a file formatrandom or linear accesson or off heapmissing value supportScala only, but cross-platform support possible

Page 56: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

What is the ceiling?This Scala loop can read integers from a binary Filo blob at a rateof 2 billion integers per second - single threaded:

def sumAllInts(): Int = { var total = 0 for { i <- 0 until numValues optimized } { total += sc(i) } total }

Page 57: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Vectorization of Spark QueriesThe project.Tungsten

Process many elements from the same column at once, keep datain L1/L2 cache.

Coming in Spark 1.4 through 1.6

Page 58: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Hot Column Caching in TachyonHas a "table" feature, originally designed for SharkKeep hot columnar chunks in shared off-heap memory for fastaccess

Page 59: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

FiloDB - RoadmapSupport for many more data types and sort and partition keys -please give us your input!Non-Spark ingestion API. Your input is again needed.In-memory caching for significant query speedupProjections. Often-repeated queries can be sped upsignificantly with projections.Use of GPU and SIMD instructions to speed up queries

Page 60: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

You can help!Send me your use cases for fast big data analysis on Spark andCassandra

Especially IoT, Event, Time-SeriesWhat is your data model?

Email if you want to contribute

Page 61: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Thanks...to the entire OSS community, but in particular:

Lee Mighdoll, Nest/GoogleRohit Rai and Satya B., TuplejumpMy colleagues at Socrata

 

If you want to go fast, go alone. If you want to gofar, go together.-- African proverb

Page 62: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

DEMO TIMEGDELT: Regular C* Tables vs FiloDB

Page 63: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Extra Slides

Page 64: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

The scenarios dataset

1979 to now60 columns, 250 million+ rows, 250GB+Let's compare Cassandra I/O only, no caching or Spark

Narrow table - CQL table with one row per partition key

Wide table - wide rows with 10,000 logical rows per partitionkeyColumnar layout - 1000 rows per columnar chunk, wide rows,with dictionary compression

Global Database of Events, Language, and Tone

First 4 million rows, localhost, SSD, C* 2.0.9, LZ4 compression.Compaction performed before read benchmarks.

Page 65: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Disk space usageScenario Disk usedNarrow table 2.7 GB

Wide table 1.6 GB

Columnar 0.34 GBThe disk space usage helps explain some of the numbers.

Page 66: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Connecting Spark to CassandraDatastax's Tuplejump

Spark Cassandra ConnectorCalliope

 

Get started in one line with spark-shell!bin/spark-shell \ --packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3 \ --conf spark.cassandra.connection.host=127.0.0.1

Page 67: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

What about C* Secondary Indexing?Spark-Cassandra Connector and Calliope can both reduce I/O byusing Cassandra secondary indices. Does this work with caching?

No, not really, because only the filtered rows would be cached.Subsequent queries against this limited cached table would notgive you expected results.

Page 68: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Turns out this has been solved before!

Even .Facebook uses Vertica

Page 69: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

MPP Databases

Easy writes plus fast queries, with constant transfersAutomatic query optimization by storing intermediate queryprojectionsStonebraker, et. al. - paper (Brown Univ)CStore

Page 70: TupleJump: Breakthrough OLAP performance on Cassandra and Spark

When in doubt, use brute force- Ken Thompson