Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

78
The New Analytics Toolbox Going beyond Hadoop

description

The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. But there are serious advantages to many of the new tools, and this presentation will give an analysis of the current state–including pros and cons as well as what’s needed to bootstrap and operate the various options. About Robbie Strickland, Software Development Manager at The Weather Channel Robbie works for The Weather Channel’s digital division as part of the team that builds backend services for weather.com and the TWC mobile apps. He has been involved in the Cassandra project since 2010 and has contributed in a variety of ways over the years; this includes work on drivers for Scala and C#, the Hadoop integration, heading up the Atlanta Cassandra Users Group, and answering lots of Stack Overflow questions.

Transcript of Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Page 1: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

The New Analytics Toolbox

Going beyond Hadoop

Page 2: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

whoami

Robbie StricklandSoftware Dev Managerlinkedin.com/in/[email protected]

Page 3: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

Page 4: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)

Page 5: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient

Page 6: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk

Page 7: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks

Page 8: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks

Page 9: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate

Page 10: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate● Configuration isn’t for the faint of heart

Page 11: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Page 12: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

MPP queries for Hadoop data

Page 13: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Varies by vendor

Page 14: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Generic in-memory analysis

Page 15: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark Hive queries on Spark, replaced by Spark SQL

Page 16: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

Page 17: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

● Can be a Hadoop replacement

Page 18: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat

Page 19: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop

Page 20: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis

Page 21: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model

Page 22: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model● Direct Cassandra integration

Page 23: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark vs. Hadoop

Page 24: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

MapReduce:

Page 25: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

MapReduce:

Page 26: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

MapReduce:

Page 27: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

MapReduce:

Page 28: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

MapReduce:

Page 29: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

MapReduce:

Page 30: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark (angels sing!):

Page 31: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

Page 32: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● In-memory cluster computing

Page 33: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce

Page 34: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets

Page 35: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java

Page 36: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing

Page 37: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing● Supports any existing Hadoop input / output

format

Page 38: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● Native graph processing via GraphX

Page 39: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib

Page 40: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL

Page 41: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR

Page 42: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR● Easily join datasets from disparate sources

Page 43: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark on Cassandra

Page 44: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

Page 45: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft

Page 46: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)

Page 47: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)● Data locality aware

Page 48: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)● Data locality aware● Uses HDFS, CassandraFS, or other

distributed FS for checkpointing

Page 49: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Cassandra with Hadoop

CassandraDataNode

TaskTracker

CassandraDataNode

TaskTracker

CassandraDataNode

TaskTracker

NameNode2ndaryNN

JobTracker

Page 50: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Cassandra with Spark (using HDFS)

CassandraSpark Worker

DataNode

CassandraSpark Worker

DataNode

CassandraSpark Worker

DataNode

Spark MasterNameNode2ndaryNN

Page 51: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

A Typical Spark Application

Page 52: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

A Typical Spark Application● SparkContext + SparkConf

Page 53: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

Page 54: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

● Transformations/Actions

Page 55: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

● Transformations/Actions

● Saving/Displaying

Page 56: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Resilient Distributed Dataset (RDD)

Page 57: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Resilient Distributed Dataset (RDD)● A distributed collection of items

Page 58: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Resilient Distributed Dataset (RDD)● A distributed collection of items

● Transformations

○ Similar to those found in scala collections

○ Lazily processed

Page 59: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Resilient Distributed Dataset (RDD)● A distributed collection of items

● Transformations

○ Similar to those found in scala collections

○ Lazily processed

● Can recalculate from any point of failure

Page 60: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

RDD Transformations vs Actions

Transformations: Produce new RDDs

Actions: Require the materialization of the records to produce a value

Page 61: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

RDD Transformations/Actions

Transformations: filter, map, flatMap, collect(λ):RDD[T], distinct, groupBy, subtract, union, zip, reduceByKey ...

Actions: collect:Array[T], count, fold, reduce ...

Page 62: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Resilient Distributed Dataset (RDD)

val numOfAdults = persons.filter(_.age > 17).count()

Transformation Action

Page 63: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Example

Page 64: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Demo

Page 65: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark SQL Example

Page 66: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark Streaming

Page 67: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark Streaming

● Creates RDDs from stream source on a defined interval

Page 68: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark Streaming

● Creates RDDs from stream source on a defined interval

● Same ops as “normal” RDDs

Page 69: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark Streaming

● Creates RDDs from stream source on a defined interval

● Same ops as “normal” RDDs● Supports a variety of sources

○ Queues○ Twitter○ Flume○ etc.

Page 70: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Spark Streaming

Page 71: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

The Output

Page 72: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Real World Task Distribution

Page 73: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Real World Task Distribution

Transformations and Actions:

Similar to Scala but …

Your choice of transformations need be made with task distribution and memory in mind

Page 74: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Real World Task Distribution

How do we do that?

● Partition the data appropriately for number of cores (at least 2 x number of cores)

● Filter early and often (Queries, Filters, Distinct...)● Use pipelines

Page 75: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Real World Task Distribution

How do we do that?

● Partial Aggregation● Create an algorithm to be as simple/efficient as it can be

to appropriately answer the question

Page 76: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Real World Task Distribution

Page 77: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Real World Task Distribution

Some Common Costly Transformations:

● sorting● groupByKey● reduceByKey● ...

Page 78: Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Thank you!

Robbie [email protected]://github.com/rstrickland/sparkdemo