Real-Time Analytics with Apache Cassandra and Apache Spark,
-
Upload
swiss-data-forum-swiss-data-forum -
Category
Data & Analytics
-
view
186 -
download
3
Embed Size (px)
Transcript of Real-Time Analytics with Apache Cassandra and Apache Spark,

BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz

Guido Schmutz
• Working for Trivadis for more than 18 years• Oracle ACE Director for Fusion Middleware and SOA• Author of different books• Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data• Technology Manager @ Trivadis
• More than 25 years of software development experience
• Contact: [email protected]• Blog: http://guidoschmutz.wordpress.com• Twitter: gschmutz

Agenda
1. Introduction2. Apache Spark3. Apache Cassandra4. Combining Spark & Cassandra5. Summary

Big Data Definition (4 Vs)
+Timetoaction?– BigData+Real-Time=StreamProcessing
CharacteristicsofBigData:ItsVolume,VelocityandVarietyincombination

What is Real-Time Analytics?
What is it? Why do we need it?
How does it work?• Collect real-time data• Process data as it flows in• Data in Motion over Data at
Rest• Reports and Dashboard
access processed data TimeEvents RespondAnalyze
Shorttimetoanalyze&respond
§ Required - fornewbusinessmodels
§ Desired - forcompetitiveadvantage

Real Time Analytics Use Cases
• Algorithmic Trading
• Online Fraud Detection
• Geo Fencing
• Proximity/Location Tracking
• Intrusion detection systems
• Traffic Management
• Recommendations
• Churn detection
• Internet of Things (IoT) / Intelligence
Sensors
• Social Media/Data Analytics
• Gaming Data Feed
• …

Apache Spark

Motivation – Why Apache Spark?
Hadoop MapReduce: Data Sharing on Disk
Spark: Speed up processing by using Memory instead of Disks
map reduce . . .Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
op1 op2 . . .Input
Output
Output

Apache Spark
Apache Spark is a fast and general engine for large-scale data processing• The hot trend in Big Data!• Originally developed 2009 in UC Berkley’s AMPLab• Based on 2007 Microsoft Dryad paper• Written in Scala, supports Java, Python, SQL and R• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x
faster on disk• One of the largest OSS communities in big data with over 200 contributors in 50+
organizations• Open Sourced in 2010 – since 2014 part of Apache Software foundation

Apache Spark
SparkSQL(BatchProcessing)
BlinkDB(ApproximateQuerying)
SparkStreaming(Real-Time)
MLlib,SparkR(MachineLearning)
GraphX(GraphProcessing)
SparkCoreAPIandExecutionModel
SparkStandalone MESOS YARN HDFS Elastic
Search NoSQL S3
Libraries
CoreRuntime
ClusterResourceManagers DataStores

Resilient Distributed Dataset (RDD)
Are• Immutable• Re-computable• Fault tolerant• Reusable
Have Transformations• Produce new RDD• Rich set of transformation available
• filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ...
Have Actions• Start cluster computing operations• Rich set of action available
• collect(), count(), fold(), reduce(), count(), …

RDD RDD
Input Source
• File• Database• Stream• Collection
.count() ->100
Data

Partitions RDD
Data
Partition0
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
Partition7
Partition8
Partition9
Server1
Server2
Server3
Server4
Server5

Partitions RDD
Data
Partition0
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
Partition7
Partition8
Partition9
Server1
Server2
Server3
Server4
Server5

Partitions RDD
Data
Partition0
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
Partition7
Partition8
Partition9
Server2
Server3
Server4
Server5

Stage 1 – reduceByKey()
Stage 1 – flatMap() + map()
Spark Workflow InputHDFSFile
HadoopRDD
MappedRDD
ShuffledRDD
TextFileOutput
sc.hapoopFile()
map()
reduceByKey()
sc.saveAsTextFile()
Transformations(Lazy)
Action(Execute
Transformations)
Master
MappedRDD
P0
P1
P3
ShuffledRDD
P0
MappedRDD
flatMap()
DAGScheduler

Spark Workflow HDFSFileInput1
HadoopRDD
FilteredRDD
MappedRDD
ShuffledRDD
HDFSFileOutput
HadoopRDD
MappedRDD
HDFSFileInput2SparkContext.hadoopFile()
SparkContext.hadoopFile()filter()
map() map()
join()
SparkContext.saveAsHadoopFile()
Transformations(Lazy)
Action(ExecuteTransformations)

Spark Execution Model
DataStorage
Worker
Master
Executer
Executer
Server
Executer

Stage 1 – flatMap() + map()
Spark Execution Model
DataStorage
Worker
Master
Executer
DataStorage
Worker
Executer
DataStorage
Worker
Executer
RDD
P0
P1
P3
NarrowTransformationMaster
filter()map()sample()flatMap()
DataStorage
Worker
Executer

Stage 2 – reduceByKey()
Spark Execution Model
DataStorage
Worker
Executer
DataStorage
Worker
Executer
RDD
P0
WideTransformation
Master
join()reduceByKey()union()groupByKey()
Shuffle!
DataStorage
Worker
Executer
DataStorage
Worker
Executer

Batch vs. Real-Time Processing
PetabytesofData
Gigaby
tes
PerS
econ
d

Various Input Sources

Apache Kafka
distributed publish-subscribe messaging system
Designed for processing of real time activity stream data (logs, metrics collections,
social media streams, …)
Initially developed at LinkedIn, now part of Apache
Does not use JMS API and standards
Kafka maintains feeds of messages in topics Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer

Apache Kafka
Kafka Broker
Temperature Processor
TemperatureTopic
RainfallTopic
1 2 3 4 5 6
RainfallProcessor1 2 3 4 5 6
WeatherStation

Apache Kafka
Kafka Broker
Temperature Processor
TemperatureTopic
RainfallTopic
1 2 3 4 5 6
RainfallProcessor
Partition0
1 2 3 4 5 6Partition0
1 2 3 4 5 6Partition1 Temperature
ProcessorWeatherStation

ApacheKafka
Kafka Broker
Temperature Processor
WeatherStation
TemperatureTopic
RainfallTopic
RainfallProcessor
P0
Temperature Processor
1 2 3 4 5
P1 1 2 3 4 5
Kafka BrokerTemperatureTopic
RainfallTopic
P0 1 2 3 4 5
P1 1 2 3 4 5
P0 1 2 3 4 5
P0 1 2 3 4 5

Discretized Stream (DStream)
Kafka
WeatherStation
WeatherStation
WeatherStation

Discretized Stream (DStream)
Kafka
WeatherStation
WeatherStation
WeatherStation

Discretized Stream (DStream)
Kafka
WeatherStation
WeatherStation
WeatherStation

Discretized Stream (DStream)
Kafka
WeatherStation
WeatherStation
WeatherStation Discretebytime
IndividualEvent
DStream =RDD

Discretized Stream (DStream)
DStream DStream
XSeconds
Transform
.countByValue()
.reduceByKey()
.join
.map

Discretized Stream (DStream)time1 time2 time3
message
timen….
f(message 1)RDD@time1
f(message 2)
f(message n)
….
message 1RDD@time1
message 2
message n
….
result 1
result 2
result n
….
message message message
f(message 1)RDD@time2
f(message 2)
f(message n)
….
message 1RDD@time2
message 2
message n
….
result 1
result 2
result n
….
f(message 1)RDD@time3
f(message 2)
f(message n)
….
message 1RDD@time3
message 2
message n
….
result 1
result 2
result n
….
f(message 1)RDD@timen
f(message 2)
f(message n)
….
message 1RDD@timen
message 2
message n
….
result 1
result 2
result n
….
InputStream
EventDStream
MappedDStreammap()
saveAsHadoopFiles()
Time Increasing
DStream
Transformation
Lineage
Actio
nsTrig
ger
SparkJobs
Adapted fromChrisFregly: http://slidesha.re/11PP7FV

Apache Spark Streaming – Core concepts
Discretized Stream (DStream)• Core Spark Streaming abstraction
• micro batches of RDD’s
• Operations similar to RDD
Input DStreams• Represents the stream of raw data received
from streaming sources
• Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc.
• Custom Sources can be easily written for custom data sources
Operations• Same as Spark Core + Additional Stateful
transformations (window, reduceByWindow)

Apache Cassandra

Apache Cassandra
Apache Cassandra™ is a free
• Distributed…
• High performance…
• Extremely scalable…
• Fault tolerant (i.e. no single point of failure)…
post-relational database solution
Optimized for high write throughput

Apache Cassandra - HistoryBigtable Dynamo

Motivation - Why NoSQL Databases?
aaa • Dynamo Paper (2007)
• How to build a data store that is
• Reliable
• Performant
• “Always On”
• Nothing new and shiny• 24 other papers cited
• Evolutionary

Motivation - Why NoSQL Databases?
• Google Big Table (2006)
• Richer data model
• 1 key and lot’s of values
• Fast sequential access
• 38 other papers cited

Motivation - Why NoSQL Databases?
• Cassandra Paper (2008)
• Distributed features of Dynamo
• Data Model and storage from BigTable
• February 2010 graduated to a top-level Apache
Project

Apache Cassandra – More than one server
All nodes participate in a clusterShared nothingAdd or remove as neededMore capacity? Add more servers
Node is a basic unit inside a cluster
Each node owns a range of partitionsConsistent Hashing
Node1
Node2
Node3
Node4 [26-50]
[0-25]
[51-75]
[76-100] [0-25]
[0-25][26-50]
[26-50][51-75]
[51-75][76-100]
[76-100]

Apache Cassandra – Fully Replicated
Client writes localData syncs across WANReplication per Data Center
Node1
Node2
Node3
Node4
Node1
Node2
Node3
Node4
WestEastClient

Apache Cassandra
What is Cassandra NOT?
• A Data Ocean• A Data Lake• A Data Pond
• An In-Memory Database
• A Key-Value Store
• Not for Data Warehousing
What are good use cases?
• Product Catalog / Playlists
• Personalization (Ads, Recommendations)
• Fraud Detection
• Time Series (Finance, Smart Meter)
• IoT / Sensor Data
• Graph / Network data

How Cassandra stores data
• Model brought from Google Bigtable• Row Key and a lot of columns• Column names sorted (UTF8, Int, Timestamp, etc.)
ColumnName … Column Name
ColumnValue ColumnValue
Timestamp Timestamp
TTL TTL
RowKey
1 2Billion
Billion
ofR
ows

Combining Spark & Cassandra

Spark and Cassandra Architecture – Great Combo
Goodatanalyzingahugeamountofdata
Goodatstoringahugeamountofdata

Spark and Cassandra Architecture
SparkStreaming(NearReal-Time)
SparkSQL(StructuredData)
MLlib(MachineLearning)
GraphX(GraphAnalysis)

Spark and Cassandra Architecture
SparkConnector
WeatherStation
SparkStreaming(NearReal-Time)
SparkSQL(StructuredData)
MLlib(MachineLearning)
GraphX(GraphAnalysis)
WeatherStation
WeatherStation
WeatherStation
WeatherStation

Spark and Cassandra Architecture
• Single Node running Cassandra
• Spark Worker is really small
• Spark Master lives outside a node
• Spark Worker starts Spark Executer in separate JVM
• Node local
Worker
Master
Executer
Executer
Server
Executer

Spark and Cassandra Architecture
Worker
Worker
Worker
Master
Worker
• Each node runs Spark and Cassandra
• Spark Master can make decisions based on Token Ranges
• Spark likes to work on small partitions of data across a large cluster
• Cassandra likes to spread out data in a large cluster
0-25
26-50
51-75
76-100
Willonly havetoanalyze25%
ofdata!

Spark and Cassandra Architecture
Master0-25
26-50
51-75
76-100
Worker
Worker
WorkerWorker
0-25
26-50
51-75
76-100
Transactional Analytics

Cassandra and Spark
Cassandra Cassandra&Spark
JoinsandUnions No Yes
Transformations Limited Yes
OutsideDataIntegration No Yes
Aggregations Limited Yes

Summary

Summary
Kafka• Topics store information broken into
partitions• Brokers store partitions• Partitions are replicated for data
resilience
Cassandra• Goals of Apache Cassandra are all
about staying online and performant• Best for applications close to your users• Partitions are similar data grouped by a
partition key
Spark• Replacement for Hadoop Map Reduce• In memory• More operations than just Map and Reduce• Makes data analysis easier• Spark Streaming can take a variety of sources
Spark + Cassandra• Cassandra acts as the storage layer for Spark• Deploy in a mixed cluster configuration• Spark executors access Cassandra using the
DataStax connector

Lambda Architecture with Spark/Cassandra
DataCollection
(Analytical)BatchDataProcessing
Batchcompute
ResultStoreDataSources
Channel
DataAccess
Reports
Service
AnalyticTools
AlertingTools
Social
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
Batchcompute
Messaging
ResultStore
QueryEngine
ResultStore
ComputedInformation
RawData(Reservoir)

Lambda Architecture with Spark/Cassandra
DataCollection
(Analytical)BatchDataProcessing
Batchcompute
ResultStoreDataSources
Channel
DataAccess
Reports
Service
AnalyticTools
AlertingTools
Social
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
Batchcompute
Messaging
ResultStore
QueryEngine
ResultStore
ComputedInformation
RawData(Reservoir)
