Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

25
A Tale of Two Technologies A Story of the IoT “revolution”

description

0.) With each device comes an implicit contract with the end user: you give us the data, we give you the results. Now. Not tomorrow. Not even fifteen minutes from now. 1.) The flip side of getting data in real time is that users expect results in real time. 2.) In return for being wired into the internet 24x7 customers demand a similar level of responsiveness and even better availability. __ Spark Streaming and Cassandra form the ideal combination of high velocity CEP and analytics with a high velocity and always on database. Today’s solutions don’t scale to the Internet of tomorrow. The always-on nature of the emerging Internet of Things space means you need to process information at previously unseen scale and, more difficult, make sense out of that data. Cassandra is the leader in large scale, high velocity, time series data workloads. While the Hadoop world has been stuck with legacy “batch analytics” technology, Cassandra users have been increasingly focused on the “now”. Fast answers to easy questions about your data, at any velocity, and any scale. But Cassandra has always been weak on the “complex questions” problem. DataStax integrated with Hadoop to overcome this limitation, but it was always an awkward fit. Slow batch analytics on top of fast moving data really doesn’t do you much good. But Spark, and in this case, Spark Streaming, make high velocity streaming analytics at scale easier than ever, similar to how Cassandra pioneered high-velocity data management at scale. Hadoop is the right choice for batch analytics. Until recently, nobody really knew what the right solution is for real-time processing. We believe that Spark and Cassandra are the clear answer.

Transcript of Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

Page 1: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

A Tale of Two Technologies

A Story of the IoT “revolution”

Page 2: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

It was the age of connectedness

Page 3: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

It was the age of disconnectedness

Failure tolerance is not optional

Page 4: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

It was the age of wisdom

Page 5: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

It was the age of foolishness

Page 6: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

It was the epoch of complexity

Did you just tell me to go #@$#@$ myself?

Page 7: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

It was the epoch of simplicity

We brought as much simplicity as we could to the menagerie known as Hadoop, but it’s still slow.

Page 8: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

What is the Internet of Things?

“The Internet of Things (IoT) refers to uniquely identifiable objects and their

virtual representations in an Internet-like structure.”-Wikipedia

Page 9: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

No really what is IoT?

● It’s literally the act of connecting “things” to the Internet

● It predates the World Wide Web● It shouldn’t be surprising to anybody

Page 10: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

So IoT is old news?

Most definitely

Page 11: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

So IoT is just hype?

Home security and automation

Fitness trackers

Your (driverless?) car

Medical devicesSensors

Industrial equipment

NO!

Page 12: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

What is Cassandra?● A massively scalable distributed database● Chooses availability over strong consistency

(yes, that really is a fundamental tradeoff)● With its wide partitions it is able to take

advantage of data locality even at “web scale”

Chocolate!

Page 13: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

What is Spark?● DAG is a logical superset of M/R● Adopts much of the Hadoop ecosystem,

without being bound by it● Intelligent use of caching (RDDs) for

massive performance gains● Incorporate Streaming to make ingestion-

time processing a first class citizen

Peanut butter!

Page 14: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

What does IoT need from big data?

● Log time-series events -- at scale● Gather meaning from that data -- at scale● Report on that data -- at scale● Take action on that data -- at scale

Page 15: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

Logging events at scale

Page 16: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

Gathering meaning at scale

Page 17: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

Reporting at scale● Canned reports● Ad-hoc querying and reporting● Drill down / exploratory ● Alerting● Aggregation● Clustering (K-means, et al)● Generalized machine learning

Page 18: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

Take action at scale● Stateless application servers● Horizontally scaled and co-located with

Cassandra and Spark in each DC● Any platform with a CQL driver

Page 19: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

The architecture...

Page 20: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

Spark Streaming to Cassandra

Page 21: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

Cassandra to Spark Streaming

Page 22: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

Multi-DC

DC1 DC2

Write anywhere.

Page 23: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

Things that go together

Page 24: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

Things that go together

Page 25: Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

A Tale of Two Summits