Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

Click here to load reader

  • date post

    05-Dec-2014
  • Category

    Technology

  • view

    885
  • download

    5

Embed Size (px)

description

Presenter: Evan Chan, Principal Software Engineer at Socrata Inc. How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets on Cassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.

Transcript of Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

  • 1. OLAP WITH SPARK AND CASSANDRA #CassandraSummit EVAN CHAN SEPT 2014
  • 2. WHO AM I? Principal Engineer, @evanfchan Creator of Socrata, Inc. http://github.com/velvia Spark Job Server
  • 3. WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE PEOPLE. data.edmonton.ca finances.worldbank.org data.cityofchicago.org data.seattle.gov data.oregon.gov data.wa.gov www.metrochicagodata.org data.cityofboston.gov info.samhsa.gov explore.data.gov data.cms.gov data.ok.gov data.nola.gov data.illinois.gov data.colorado.gov data.austintexas.gov data.undp.org www.opendatanyc.com data.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.it data.montgomerycountymd.gov data.cityofnewyork.us data.acgov.org data.baltimorecity.gov data.energystar.gov data.somervillema.gov data.maryland.gov data.taxpayer.net bronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org
  • 4. WE ARE SWIMMING IN DATA!
  • 5. BIG DATA AT SOCRATA Tens of thousands of datasets, each one up to 30 million rows Customer demand for billion row datasets Want to analyze across datasets
  • 6. BIG DATA AT OOYALA 2.5 billion analytics pings a day = almost a trillion events a year. Roll up tables - 30 million rows per day
  • 7. HOW CAN WE ALLOW CUSTOMERS TO QUERY A YEAR'S WORTH OF DATA? Flexible - complex queries included Sometimes you can't denormalize your data enough Fast - interactive speeds Near Real Time - can't make customers wait hours before querying new data
  • 8. RDBMS? POSTGRES? Start hitting latency limits at ~10 million rows No robust and inexpensive solution for querying across shards No robust way to scale horizontally PostGres runs query on single thread unless you partition (painful!) Complex and expensive to improve performance (eg rollup tables, huge expensive servers)
  • 9. OLAP CUBES? Materialize summary for every possible combination Too complicated and brittle Takes forever to compute - not for real time Explodes storage and memory
  • 10. When in doubt, use brute force - Ken Thompson
  • 11. CASSANDRA Horizontally scalable Very flexible data modelling (lists, sets, custom data types) Easy to operate No fear of number of rows or documents Best of breed storage technology, huge community BUT: Simple queries only
  • 12. APACHE SPARK Horizontally scalable, in-memory queries Functional Scala transforms - map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, many more plugins all on ONE platform - feed your SQL results to a logistic regression, easy! THE Hottest big data platform, huge community, leaving Hadoop in the dust Developers love it
  • 13. SPARK PROVIDES THE MISSING FAST, DEEP ANALYTICS PIECE OF CASSANDRA!
  • 14. INTEGRATING SPARK AND CASSANDRA Scala solutions: Datastax integration: https://github.com/datastax/spark-cassandra- connector (CQL-based) Calliope
  • 15. A bit more work: Use traditional Cassandra client with RDDs Use an existing InputFormat, like CqlPagedInputFormat Only reason to go here is probably you are not on CQL version of Cassandra, or you're using Shark/Hive.
  • 16. A SPARK AND CASSANDRA OLAP ARCHITECTURE
  • 17. SEPARATE STORAGE AND QUERY LAYERS Combine best of breed storage and query platforms Take full advantage of evolution of each Storage handles replication for availability Query can replicate data for scaling read concurrency - independent!
  • 18. SCALE NODES, NOT DEVELOPER TIME!!
  • 19. KEEPING IT SIMPLE Maximize row scan speed Columnar representation for efficiency Compressed bitmap indexes for fast algebra Functional transforms for easy memoization, testing, concurrency, composition
  • 20. SPARK AS CASSANDRA'S CACHE
  • 21. EVEN BETTER: TACHYON OFF-HEAP CACHING
  • 22. INITIAL ATTEMPTS val rows = Seq( Seq("Burglary", "19xx Hurston", 10), Seq("Theft", "55xx Floatilla Ave", 5) ) sc.parallelize(rows) .map { values => (values[0], values) } .groupByKey .reduce(_[2] + _[2])
  • 23. No existing generic query engine for Spark when we started (Shark was in infancy, had no indexes, etc.), so we built our own For every row, need to extract out needed columns Ability to select arbitrary columns means using Seq[Any], no type safety Boxing makes integer aggregation very expensive and memory inefficient
  • 24. COLUMNAR STORAGE AND QUERYING
  • 25. The traditional row-based data storage approach is dead - Michael Stonebraker
  • 26. TRADITIONAL ROW-BASED STORAGE Same layout in memory and on disk: Name Age Barak 46 Hillary 66 Each row is stored contiguously. All columns in row 2 come after row 1.
  • 27. COLUMNAR STORAGE (MEMORY) Name column 0 1 0 1 Dictionary: {0: "Barak", 1: "Hillary"} Age column 0 1 46 66
  • 28. COLUMNAR STORAGE (CASSANDRA) Review: each physical row in Cassandra (e.g. a "partition key") stores its columns together on disk. Schema CF Rowkey Type Name StringDict Age Int Data CF Rowkey 0 1 Name 0 1 Age 46 66
  • 29. ADVANTAGES OF COLUMNAR STORAGE Compression Dictionary compression - HUGE savings for low-cardinality string columns RLE Reduce I/O Only columns needed for query are loaded from disk Can keep strong types in memory, avoid boxing Batch multiple rows in one cell for efficiency
  • 30. ADVANTAGES OF COLUMNAR QUERYING Cache locality for aggregating column of data Take advantage of CPU/GPU vector instructions for ints / doubles avoid row-ifying until last possible moment easy to derive computed columns Use vector data / linear math libraries
  • 31. COLUMNAR QUERY ENGINE VS ROW-BASED IN SCALA Custom RDD of column-oriented blocks of data Uses ~10x less heap 10-100x faster for group by's on a single node Scan speed in excess of 150M rows/sec/core for integer aggregations
  • 32. SO, GREAT, OLAP WITH CASSANDRA AND SPARK. NOW WHAT?
  • 33. DATASTAX: CASSANDRA SPARK INTEGRATION Datastax Enterprise now comes with HA Spark HA master, that is. spark-cassandra-connector
  • 34. SPARK SQL Appeared with Spark 1.0 In-memory columnar store Can read from Parquet and JSON now; direct Cassandra integration coming Querying is not column-based (yet) No indexes Write custom functions in Scala .... take that Hive UDFs!! Integrates well with MLBase, Scala/Java/Python
  • 35. CACHING A SQL TABLE FROM CASSANDRA val sqlContext = new org.apache.spark.sql.SQLContext(sc) sc.cassandraTable[GDeltRow]("gdelt, "1979to2009") .registerAsTable("gdelt") sqlContext.cacheTable("gdelt") sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER Remember Spark is lazy, nothing is executed until the collect() In Spark 1.1+: registerTempTable
  • 36. SOME PERFORMANCE NUMBERS GDELT dataset, 117 million rows, 57 columns, ~50GB Spark 1.0.2, AWS 8 x c3.xlarge, cached in memory Query Avg time (sec) SELECT count(*) FROM gdelt WHERE Actor2CountryCode = 'CHN' 0.49 SELECT 4 columns Top K 1.51 SELECT Top countries by Avg Tone 2.69 (Group By)
  • 37. IMPORTANT - CACHING By default, queries will read data from source - Cassandra - every time Spark RDD Caching - much faster, but big waste of memory (row oriented) Spark SQL table caching - fastest, memory efficient
  • 38. WORK STILL NEEDED Indexes Columnar querying for fast aggregation Tachyon support for Cassandra/CQL Efficient reading from columnar storage formats
  • 39. LESSONS Extremely fast distributed querying for these use cases Data doesn't change much (and only bulk changes) Analytical queries for subset of columns Focused on numerical aggregations Small numbers of group bys For fast query performance, cache your data using Spark SQL Concurrent queries is a frontier with Spark. Use additional Spark contexts.
  • 40. THANK YOU!
  • 41. EXTRA SLIDES
  • 42. EXAMPLE CUSTOM INTEGRATION USING ASTYANAX val cassRDD = sc.parallelize(rowkeys). flatMap { rowkey => columnFamily.get(rowkey).execute().asScala }
  • 43. SOME COLUMNAR ALTERNATIVES Monetdb and Infobright - true columnar stores (storage + querying) Vertica and C-Store Google BigQuery - columnar cloud database, Dremel based Amazon RedShift