Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

22
FEELIN' THE FLOW GETTING YOUR DATA MOVING WITH SPARK AND CASSANDRA Presented by / October 14th, 2014 Rich Beaudoin @RichGBeaudoin

description

Speaker: Rich Beaudoin, Senior Software Engineer at Pearson eCollege In the world of Big Data it's crucial that your data is accessible. Cassandra provides us with a means to reliably store our data, but how can we keep it flowing? That's where Spark steps up to provide a powerful one-two punch with Cassandra to get your data flowing in all the right directions.

Transcript of Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

Page 1: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

FEELIN' THEFLOW

GETTING YOUR DATA MOVINGWITH SPARK AND CASSANDRA

Presented by / October 14th, 2014Rich Beaudoin @RichGBeaudoin

Page 2: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

ABOUT ME...Sr. Software Engineer at PearsonOrganizer of Lover of MusicAll around solid dude

Distributed Computing Denver

Page 3: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

OVERVIEWWhat is Spark

The problem it solvesThe core concepts

Spark integration with CassandraTables as RDDsWriting RDDs to Cassandra

Question and Summary

Page 4: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

WHAT IS SPARK?Apache Spark™ is a fast and general engine

for large-scale data processing.

Created by AMPLab at UC BerkeleyBecame Apache Top-Level Project in 2014Supports Scala, Java, and Python APIs

Page 5: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

...so each cycle of processing incurs latency from HDFS reads

THE PROBLEM, PARTONE...

Approaches like MapReduce read from, and store to HDFS

Page 6: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

But existing solutions allow for "fine-grained" (cell level)updates, which can complicate the handling of faults where

data needs to be rebuilt/recalculated

THE PROBLEM, PARTTWO...

Any robust, distributed data processing framework needs faulttolerance

Page 7: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

SPARK ATTEMPTS TOADDRESS THESE TWO

PROBLEMSSolution 1: store intermediate results in memory

Solution 2: introduce a new expressive data abstraction

Page 8: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

RDDA Resilient Distributed Dataset (RDD) is an an

immutable, partioned record that supportsbasic operations (e.g. map, filter, join). It

maintains a graph of transformations in orderto enable recovery of a lost partition

*See the RDD for more detailswhite paper

Page 9: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

TRANSFORMATIONSAND ACTIONS

"transformation" creates another RDD, is evaluated lazily

"action" returns a value, evaluated immediately

Page 10: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

RDDS ARE EXPRESSIVEIt turns out that coarse-grained operations cover many existing

parrallel computing cases

Consequently, the RDD abstraction can implement existingsystems like MapReduce, Pregel, Dryad, etc.

Page 11: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

SPARK CLUSTEROVERVIEW

Spark can be run with Apache Mesos, HADOOP Yarn, or it's own standalone cluster manager

Page 12: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

JOB SCHEDULING ANDSTAGES

Page 13: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

SPARK AND CASSANDRAIf we can turn Cassandra data into RDDs, and RDDs into

Cassandra data, then the data can start flowing between thetwo systems and give us some insight into our data.

allows us to perform thetransformation from Cassadra table to RDD and then back

again!

The Spark Cassandra Connector

Page 14: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

THE SETUP

Page 15: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

FROM CASSANDRATABLE TO RDD

import org.apache.spark._import com.datastax.spark.connector._

val rdd = sc.cassandraTable("music", "albums_by_artist")

Run these commands spark-shell, requires specifying the spark-connector jar on the commandline

Page 16: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

SIMPLE MAPREDUCE FORRDD COLUMN COUNT

val count = rdd.map(x => (x.get[String]("label"),1)).reduceByKey(_ + _)

Page 17: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

SAVE THE RDD TOCASSANDRA

count.saveToCassandra("music", "label_count",SomeColumns("label", "count"))

Page 18: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

CASSANDRA WITHSPARKSQL

import org.apache.spark.sql.cassandra.CassandraSQLContext

val cc = new CassandraSQLContext(sc)val rdd = cc.sql("SELECT * from music.label_count")

Page 19: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

JOINS!!!import sqlContext.createSchemaRDDimport org.apache.spark.sql._

case class LabelCount(label: String, count: Int)case class AlbumArtist(artist: String, album: String, label: String, year: Int)case class AlbumArtistCount(artist: String, album: String, label: String, year: Int, count

val albumArtists = sc.cassandraTable[AlbumArtist]("music","albums_by_artists").cacheval labelCounts = sc.cassandraTable[LabelCount]("music", "label_count").cache

val albumsByLabelId = albumArtists.keyBy(x => x.label)val countsByLabelId = labelCounts.keyBy(x => x.label)

val joinedAlbums = albumsByLabelId.join(countsByLabelId).cacheval albumArtistCountObjects = joinedAlbums.map(x => (new AlbumArtistCount(x._2._1.artist, x

Page 20: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

OTHER THINGS TOCHECK OUT

Spark StreamingSpark SQL

Page 21: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

QUESTIONS?

Page 22: Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

THE ENDReferences

Resilient Distributed Datasets: A Fault-Tolerant Abstractionfor In-Memory Cluster ComputingSpark Programming GuideApache Spark WebsiteDatastax Spark Cassandra Connector Documentation