Big Data Analytics with Spark

Company Confidential 1

Sorry for the Delay• There were some technical difficulties, so we are giving folks a

few more minutes to join

• Again – sorry for the dely

© 2014 DataStax, All Rights Reserved.

All attendees

placed on muteInput questions at any time

using the online interface

Webinar Housekeeping

Big Data Analytics with Cassandra and Spark

Brian Hess

Sr. Product Manager for Analytics

DataStax


Willie SuttonBank Robber in the 1930s-1950sFBI Most Wanted List 1950 Captured in 1952


Willie Sutton

When asked “Why do you rob banks?”

“Because that’s where the money is.”


Motivating Use CaseInternet of Things


Your Syste

m




Your Syste

m




Your Syste

mFAULT


Cassandra

Spark

Spark + Cassandra


Apache Cassandra

• Distributed NoSQL database– BigTable meets Dynamo

• All nodes are equal– Always on– Linear scale out - a lot

• More data• More transactions

• Multi-Datacenter– Geographic or Workload

• Cassandra Query Language– SQL-like


200,000txns/sec

100,000txns/sec

400,000txns/sec


How Cassandra Works – Writes


It’s 72°




It’s 72°




Done


Tunable Consistency

• Relax the Consistency in ACID– Isn’t always needed – and isn’t guaranteed anyway (in distributed DBs)– Reads my not get the most up-to-date data – but almost always will

• All data is replicated– Set in the schema– Distributed to nodes by Token Range

• Options:– QUORUM, ONE, ALL

• Can ensure reads get most up-to-date value– E.g. – read/write at QUORUM



How Cassandra Works – Tunable Consistency


You got it. I’ll make sure

everyone gets it.

You got it. A majority got it.

The rest will.

You got it.One guy got it.The rest will.

You got it. Everyone has it.


How Cassandra Works – Query


SELECT user_idFROM usersWHERE name =‘PBCupFan’;




Sure Thing, Let me get that for you.





What do you guys have for PBCup?





Here’s what I have:

Here’s what I have:





Let me resolve any conflicts





Here ya go!

user_id--------- 1234

(1 rows)


Cassandra for Internet of Things

It’s all about scaling



Cassandra

• Always On– No down time

• Linear Scalability– For writes or reads– For data size


• Terrific choice for Internet of Things, Web, Mobile, etc. – British Gas, Nike, etc – Thermostats, Manufacturing, Oil/Gas, etc

It’s where the data is!


Cassandra Limitations

• No aggregations– Optimized for lookups & writes– No GROUP BYs– No Windowed Aggregates

• No Joins– Data model to avoid

• Must select by partition key– There are secondary indexes

• But they are an antipattern

• Not optimized for full-table scans


It actually can’t do everything


Apache Spark

• Distributed computing framework• Generalized DAG execution• Easy Abstraction for Datasets• Integrated SQL Queries• Streaming• Machine Learning Library


All In One Package!


Spark Components


Spark Core Engine

Spark SQL SparkStreaming

MLlib GraphX Spark R


Spark Components


TRANSFORMS

SQLStreaming Machine

LearningGraph

Analytics R

Spark Provides a Simple and Efficient framework for Distributed ComputationsNode Roles 2

In Memory Caching Yes!

Generic DAG Execution Yes!

Great Abstraction For Datasets? Dataframe!(previously Resilient Distributed Dataset (RDD))

SparkMaster

SparkWorker

SparkWorker

SparkWorkerSpark Executor

Spark Partition

Dataframe(or RDD)

Spark Provides a Simple and Efficient framework for Distributed Computations

Spark Master: Assigns cluster resources to applicationsSpark Worker: Manages executors running on a machineSpark Executor: Started by Worker - Workhorse of the spark application

SparkMaster

SparkWorker

SparkWorker

SparkWorkerSpark Executor

Spark Partition

Dataframe(or RDD)

RDDs Can be Generated from a Variety of Sources

Textfiles

Parallelized Collections


Spark on Cassandra


Spark Core Engine

Spark SQL SparkStreaming

MLlib GraphX Spark R

Cassandra

DataStax Spark-Cassandra Connector

Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to Cassandra

Each Executor Maintains a connection to the C* Cluster

Spark Executor

DataStax Java Driver

Tokens 1-1000

Tokens 1001 -2000

Tokens …

RDD’s read into different splits based on sets of tokens

C*

Full TokenRange


Co-locate Spark and C* for Best Performance

• Run Cassandra and Spark on same nodes

• Local reads/writes

• Increased performance



Things you can’t do in Cassandra– Using SparkSQL

• JOINssc.sql("SELECT t.sensor_id, t.temp, m.location

FROM ks.temperatures t JOIN ks.metadata mON t.sensor_id = m.sensor_idWHERE t.sensor_id = 12345");

• Aggregatessc.sql("SELECT sensor_id, year, month, MAX(temp) mtemp

FROM ks.temperaturesGROUP BY sensor_id, year, month");



Things you can’t do in Cassandra– External Data

• JOIN with HDFS data

val temp2014 = sc.textFile("webhdfs://myhadoop/data/temp2014.csv").map(x=>x.split(",")).map(x=>((x(0).toInt, x(1).toInt, x(2).toInt),

x(3).toDouble))val temp2015 = sc.cassandraTable("ks", "temperatures").

map(x=>((x.getInt("sensor_id"), x.getInt("year"), x.getInt("month")),

x.getDouble("avgTemp")))val hotter = temp2015.join(temp2014).filter(x => x._2._1._1 > x._2._2._1)

• Non-Partition Key Predicates

csc.sql("SELECT * FROM ks.temperatures WHERE temp > 100")© 2014 DataStax, All Rights Reserved.


Tools

• ODBC and JDBC tools via SparkSQL– Tableau, Pentaho, R, etc

• Apache Zeppelin (incubating)A web-based notebook

that enables interactive data

analytics.



Quick word on Spark Streaming and Cassandra• Very good combination

– Simple, powerful, useful, scalable, etc, etc, etc.


Rec

eive

r


Quick word on Spark Streaming and Cassandra


import com.datastax.spark.connector.streaming._

// Spark connection optionsval conf = new SparkConf(true)...

// streaming with 1 second batch windowval ssc = new StreamingContext(conf, Seconds(1))

// stream inputval lines = ssc.socketTextStream(serverIP, serverPort)

// count wordsval wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

// stream outputwordCounts.saveToCassandra("test", "words")

// start processingssc.start() ssc.awaitTermination()


DataStax Enterprise


Combines Cassandra,Spark, and Solr (and more!)

- Fault Tolerance- Management- Visual Monitoring- Security- ETC!


Cassandra + Spark

• Unleash the power of analytics• On your operational data

– IoT, Web, Mobile, etc


“Because that’s where the Data is.”


Contacts and Links

• Links– Cassandra Summit: http://cassandrasummit-datastax.com/ – DataStax Academy: https://academy.datastax.com/

• Contacts– Kevin Pardue, Regional Channel Manager: [email protected] – Brian Hess, Sr Product Manager for Analytics: [email protected] – Devin Saxon, Marketing Specialist: [email protected]


http://cassandrasummit-datastax.com/

http://cassandrasummit-datastax.com/

https://academy.datastax.com/

https://academy.datastax.com/

mailto:[email protected]



Big Data Analytics with Spark

Technology

Transcript of Big Data Analytics with Spark