Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

FEELIN' THEFLOW

GETTING YOUR DATA MOVINGWITH SPARK AND CASSANDRA

Presented by / October 14th, 2014Rich Beaudoin @RichGBeaudoin

https://www.linkedin.com/pub/rich-beaudoin/3b/b22/26

http://twitter.com/richgbeaudoin

ABOUT ME...Sr. Software Engineer at PearsonOrganizer of Lover of MusicAll around solid dude

Distributed Computing Denver

http://www.meetup.com/Distributed-Computing-Denver/

OVERVIEWWhat is Spark

The problem it solvesThe core concepts

Spark integration with CassandraTables as RDDsWriting RDDs to Cassandra

Question and Summary

WHAT IS SPARK?Apache Spark™ is a fast and general engine

for large-scale data processing.

Created by AMPLab at UC BerkeleyBecame Apache Top-Level Project in 2014Supports Scala, Java, and Python APIs

...so each cycle of processing incurs latency from HDFS reads

THE PROBLEM, PARTONE...

Approaches like MapReduce read from, and store to HDFS

But existing solutions allow for "fine-grained" (cell level)updates, which can complicate the handling of faults where

data needs to be rebuilt/recalculated

THE PROBLEM, PARTTWO...

Any robust, distributed data processing framework needs faulttolerance

SPARK ATTEMPTS TOADDRESS THESE TWO

PROBLEMSSolution 1: store intermediate results in memory

Solution 2: introduce a new expressive data abstraction

RDDA Resilient Distributed Dataset (RDD) is an an

immutable, partioned record that supportsbasic operations (e.g. map, filter, join). It

maintains a graph of transformations in orderto enable recovery of a lost partition

*See the RDD for more detailswhite paper

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

TRANSFORMATIONSAND ACTIONS

"transformation" creates another RDD, is evaluated lazily

"action" returns a value, evaluated immediately

RDDS ARE EXPRESSIVEIt turns out that coarse-grained operations cover many existing

parrallel computing cases

Consequently, the RDD abstraction can implement existingsystems like MapReduce, Pregel, Dryad, etc.

SPARK CLUSTEROVERVIEW

Spark can be run with Apache Mesos, HADOOP Yarn, or it's own standalone cluster manager

JOB SCHEDULING ANDSTAGES

SPARK AND CASSANDRAIf we can turn Cassandra data into RDDs, and RDDs into

Cassandra data, then the data can start flowing between thetwo systems and give us some insight into our data.

allows us to perform thetransformation from Cassadra table to RDD and then back

again!

The Spark Cassandra Connector

https://github.com/datastax/spark-cassandra-connector

THE SETUP

FROM CASSANDRATABLE TO RDD

import org.apache.spark._import com.datastax.spark.connector._

val rdd = sc.cassandraTable("music", "albums_by_artist")

Run these commands spark-shell, requires specifying the spark-connector jar on the commandline

SIMPLE MAPREDUCE FORRDD COLUMN COUNT

val count = rdd.map(x => (x.get[String]("label"),1)).reduceByKey(_ + _)

SAVE THE RDD TOCASSANDRA

count.saveToCassandra("music", "label_count",SomeColumns("label", "count"))

CASSANDRA WITHSPARKSQL

import org.apache.spark.sql.cassandra.CassandraSQLContext

val cc = new CassandraSQLContext(sc)val rdd = cc.sql("SELECT * from music.label_count")

JOINS!!!import sqlContext.createSchemaRDDimport org.apache.spark.sql._

case class LabelCount(label: String, count: Int)case class AlbumArtist(artist: String, album: String, label: String, year: Int)case class AlbumArtistCount(artist: String, album: String, label: String, year: Int, count

val albumArtists = sc.cassandraTable[AlbumArtist]("music","albums_by_artists").cacheval labelCounts = sc.cassandraTable[LabelCount]("music", "label_count").cache

val albumsByLabelId = albumArtists.keyBy(x => x.label)val countsByLabelId = labelCounts.keyBy(x => x.label)

val joinedAlbums = albumsByLabelId.join(countsByLabelId).cacheval albumArtistCountObjects = joinedAlbums.map(x => (new AlbumArtistCount(x._2._1.artist, x

OTHER THINGS TOCHECK OUT

Spark StreamingSpark SQL

https://spark.apache.org/streaming/

https://spark.apache.org/sql/

QUESTIONS?

THE ENDReferences

Resilient Distributed Datasets: A Fault-Tolerant Abstractionfor In-Memory Cluster ComputingSpark Programming GuideApache Spark WebsiteDatastax Spark Cassandra Connector Documentation

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

https://spark.apache.org/

https://spark.apache.org/

https://github.com/datastax/spark-cassandra-connector#documentation

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

Technology

Transcript of Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra