Using Spark over Cassandra

15
REAL TIME ANALYTICS WITH SPARK OVER CASSANDRA Meetup - Jan 27, 2014

description

Overview of using Apache Spark over Cassandra to allow fast queries for online use over BigData

Transcript of Using Spark over Cassandra

Page 1: Using Spark over Cassandra

REAL TIME ANALYTICS WITH SPARK OVER CASSANDRA

Meetup - Jan 27, 2014

Page 2: Using Spark over Cassandra

Spark is a distributed open-source framework for real-time data processing over Hadoop.Influenced by Google’s Dremel project, it was developed at UC Berkeley and is now part of Apache Incubator.Spark introduces a functional mapreduce model for repetitive iterations over distributed data.

Intro - What’s Spark?

Page 3: Using Spark over Cassandra

1. In-coming events digestion (filtering, categorizing, storing). We currently use RabbitMQ and Storm, but Spark can be used here.

2. Batch processingIn our case, attribution per-conversion, aggregation per-keyword/ad and optimization per-campaign. We currently have a proprietary infrastructure, that doesn’t scale very well. Spark would shine here.

3. Grids/widgets for online - slice n dice.We currently have aggregation tables in MySQL. A short demo of what we did here with Spark...

Basic problems

Page 4: Using Spark over Cassandra

We manage billions of keywords.We handle hundreds of millions of clicks and conversions per day.Our clients query the data in lots of varying ways:

● different aggregations, time-periods, filters, sorts

● drill-down to specific items of concern

Problem #3: Grids over billions of cells

Page 5: Using Spark over Cassandra

Architecture for demo

Spark Master Web Server

Grid in App Server

Spark Worker Cassandra

Spark Worker Cassandra

Page 6: Using Spark over Cassandra

Demo...

Page 7: Using Spark over Cassandra

Code Snippet - setting up an RDD

val job = new Job()

job.setInputFormatClass(classOf[ColumnFamilyInputFormat])

val configuration: Configuration = job.getConfiguration

ConfigHelper.setInputInitialAddress(configuration, cassandraHost)

ConfigHelper.setInputRpcPort(configuration, cassandraPort)

ConfigHelper.setOutputInitialAddress(configuration, cassandraHost)

ConfigHelper.setOutputRpcPort(configuration, cassandraPort)

ConfigHelper.setInputColumnFamily(configuration, keyspaceName, columnFamily)

ConfigHelper.setThriftFramedTransportSizeInMb(configuration, 2047)

ConfigHelper.setThriftMaxMessageLengthInMb(configuration, 2048)

ConfigHelper.setInputSlicePredicate(configuration, predicate)

ConfigHelper.setInputPartitioner(configuration, "Murmur3Partitioner")

ConfigHelper.setOutputPartitioner(configuration, "Murmur3Partitioner")

val casRdd = sc.newAPIHadoopRDD( configuration, classOf[ColumnFamilyInputFormat],

classOf[ByteBuffer], classOf[util.SortedMap[ByteBuffer, IColumn]])

Page 8: Using Spark over Cassandra

Mapp’in & reduc’in with Spark

val flatRdd = creaeteFlatRDD(cachedRDD, startDate, endDate, profileId, statusInTarget)

val withGroupByScores = flatRdd.map {

case (entity, performance) => {

val scores = performance.groupBy(score => score.name )

(entity, scores)

}

}

val withAggrScores = withGroupByScores.map {

case (entity, scores) => { val aggrScores = scores.map {

case (column, sc) => {

val aggregation = sc.reduce[Score]({

(left, right) => { Score(left.name, left.value + right.value) })

(column, aggregation)

}}

(entity, aggrScores)

} }

Page 9: Using Spark over Cassandra

Reading RAM is suddenly a hot-spot..

def createByteArray(date: String, column: Column, value: ByteBuffer): Array[Byte] = {

val daysFromEpoch = calcDaysFromEpoch(date)

val columnOrdinal = column.id

val buffer = ByteBuffer.allocate(4 + 4 + value.remaining())

buffer.putInt(daysFromEpoch)

buffer.putInt(columnOrdinal)

buffer.put(value)

buffer.array()

}

Page 10: Using Spark over Cassandra

● For this demo: EC2 Cluster of Master and 2 Slave nodes.● Each Slave with: 240Gb memory, 32 cores, SSD drives,

10Gb network● Data size: 100Gb● Cassandra 2.1● Spark 0.8.0● Rule of thumb for cost est.: ~25K $ / Tb of data.

You’ll probably need X2 memory, as RDD’s are immutable.

White Hat* - facts

* Colored hats metaphor taken from de Bono’s “Six Thinking Hats”

Page 11: Using Spark over Cassandra

Yellow Hat - optimism

● Full slice ‘n dice over all data with acceptable latency for online (< 5 seconds)

● Additional aggregations at no extra performance cost● Ease of setup (but as always, be prepared for some

tinkering)● Single source of truth● Horizontal scale● MapReduce capabilities for machine learning algorithms● Enable merging recent data with old data (what Nathan

Marz coinded: “lambda architecture”)

Page 12: Using Spark over Cassandra

Black Hat - concerns

● System stability● Changing API● Limited ecosystem● Scala-based code - learning curve● Maintainability: optimal speed means low-level of

abstraction.● Data duplication, especially in-transit● Master node is a single point of failure● Scheduling

Page 13: Using Spark over Cassandra

Green Hat - alternatives

● Alternatives to Spark: ○ Cloudera’s Impala (commercial product)○ Presto (recently open-sourced out of Facebook)○ Trident/Storm (for stream processing)

Page 14: Using Spark over Cassandra

Red Hat - emotions, intuitions

Spark’s technology is nothing short of astonishing: yesterday’s “impossible!” is today’s “merely difficult...”

Page 15: Using Spark over Cassandra

*

THANK YOU(AND YES, WE ARE HIRING!)

[email protected]