Real-Time Analytics with Apache Cassandra and Apache Spark,
-
Upload
swiss-data-forum-swiss-data-forum -
Category
Data & Analytics
-
view
198 -
download
3
Transcript of Real-Time Analytics with Apache Cassandra and Apache Spark,
BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
Guido Schmutz
• Working for Trivadis for more than 18 years• Oracle ACE Director for Fusion Middleware and SOA• Author of different books• Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data• Technology Manager @ Trivadis
• More than 25 years of software development experience
• Contact: [email protected]• Blog: http://guidoschmutz.wordpress.com• Twitter: gschmutz
Agenda
1. Introduction2. Apache Spark3. Apache Cassandra4. Combining Spark & Cassandra5. Summary
Big Data Definition (4 Vs)
+Timetoaction?– BigData+Real-Time=StreamProcessing
CharacteristicsofBigData:ItsVolume,VelocityandVarietyincombination
What is Real-Time Analytics?
What is it? Why do we need it?
How does it work?• Collect real-time data• Process data as it flows in• Data in Motion over Data at
Rest• Reports and Dashboard
access processed data TimeEvents RespondAnalyze
Shorttimetoanalyze&respond
§ Required - fornewbusinessmodels
§ Desired - forcompetitiveadvantage
Real Time Analytics Use Cases
• Algorithmic Trading
• Online Fraud Detection
• Geo Fencing
• Proximity/Location Tracking
• Intrusion detection systems
• Traffic Management
• Recommendations
• Churn detection
• Internet of Things (IoT) / Intelligence
Sensors
• Social Media/Data Analytics
• Gaming Data Feed
• …
Apache Spark
Motivation – Why Apache Spark?
Hadoop MapReduce: Data Sharing on Disk
Spark: Speed up processing by using Memory instead of Disks
map reduce . . .Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
op1 op2 . . .Input
Output
Output
Apache Spark
Apache Spark is a fast and general engine for large-scale data processing• The hot trend in Big Data!• Originally developed 2009 in UC Berkley’s AMPLab• Based on 2007 Microsoft Dryad paper• Written in Scala, supports Java, Python, SQL and R• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x
faster on disk• One of the largest OSS communities in big data with over 200 contributors in 50+
organizations• Open Sourced in 2010 – since 2014 part of Apache Software foundation
Apache Spark
SparkSQL(BatchProcessing)
BlinkDB(ApproximateQuerying)
SparkStreaming(Real-Time)
MLlib,SparkR(MachineLearning)
GraphX(GraphProcessing)
SparkCoreAPIandExecutionModel
SparkStandalone MESOS YARN HDFS Elastic
Search NoSQL S3
Libraries
CoreRuntime
ClusterResourceManagers DataStores
Resilient Distributed Dataset (RDD)
Are• Immutable• Re-computable• Fault tolerant• Reusable
Have Transformations• Produce new RDD• Rich set of transformation available
• filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ...
Have Actions• Start cluster computing operations• Rich set of action available
• collect(), count(), fold(), reduce(), count(), …
RDD RDD
Input Source
• File• Database• Stream• Collection
.count() ->100
Data
Partitions RDD
Data
Partition0
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
Partition7
Partition8
Partition9
Server1
Server2
Server3
Server4
Server5
Partitions RDD
Data
Partition0
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
Partition7
Partition8
Partition9
Server1
Server2
Server3
Server4
Server5
Partitions RDD
Data
Partition0
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
Partition7
Partition8
Partition9
Server2
Server3
Server4
Server5
Stage 1 – reduceByKey()
Stage 1 – flatMap() + map()
Spark Workflow InputHDFSFile
HadoopRDD
MappedRDD
ShuffledRDD
TextFileOutput
sc.hapoopFile()
map()
reduceByKey()
sc.saveAsTextFile()
Transformations(Lazy)
Action(Execute
Transformations)
Master
MappedRDD
P0
P1
P3
ShuffledRDD
P0
MappedRDD
flatMap()
DAGScheduler
Spark Workflow HDFSFileInput1
HadoopRDD
FilteredRDD
MappedRDD
ShuffledRDD
HDFSFileOutput
HadoopRDD
MappedRDD
HDFSFileInput2SparkContext.hadoopFile()
SparkContext.hadoopFile()filter()
map() map()
join()
SparkContext.saveAsHadoopFile()
Transformations(Lazy)
Action(ExecuteTransformations)
Spark Execution Model
DataStorage
Worker
Master
Executer
Executer
Server
Executer
Stage 1 – flatMap() + map()
Spark Execution Model
DataStorage
Worker
Master
Executer
DataStorage
Worker
Executer
DataStorage
Worker
Executer
RDD
P0
P1
P3
NarrowTransformationMaster
filter()map()sample()flatMap()
DataStorage
Worker
Executer
Stage 2 – reduceByKey()
Spark Execution Model
DataStorage
Worker
Executer
DataStorage
Worker
Executer
RDD
P0
WideTransformation
Master
join()reduceByKey()union()groupByKey()
Shuffle!
DataStorage
Worker
Executer
DataStorage
Worker
Executer
Batch vs. Real-Time Processing
PetabytesofData
Gigaby
tes
PerS
econ
d
Various Input Sources
Apache Kafka
distributed publish-subscribe messaging system
Designed for processing of real time activity stream data (logs, metrics collections,
social media streams, …)
Initially developed at LinkedIn, now part of Apache
Does not use JMS API and standards
Kafka maintains feeds of messages in topics Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
Apache Kafka
Kafka Broker
Temperature Processor
TemperatureTopic
RainfallTopic
1 2 3 4 5 6
RainfallProcessor1 2 3 4 5 6
WeatherStation
Apache Kafka
Kafka Broker
Temperature Processor
TemperatureTopic
RainfallTopic
1 2 3 4 5 6
RainfallProcessor
Partition0
1 2 3 4 5 6Partition0
1 2 3 4 5 6Partition1 Temperature
ProcessorWeatherStation
ApacheKafka
Kafka Broker
Temperature Processor
WeatherStation
TemperatureTopic
RainfallTopic
RainfallProcessor
P0
Temperature Processor
1 2 3 4 5
P1 1 2 3 4 5
Kafka BrokerTemperatureTopic
RainfallTopic
P0 1 2 3 4 5
P1 1 2 3 4 5
P0 1 2 3 4 5
P0 1 2 3 4 5
Discretized Stream (DStream)
Kafka
WeatherStation
WeatherStation
WeatherStation
Discretized Stream (DStream)
Kafka
WeatherStation
WeatherStation
WeatherStation
Discretized Stream (DStream)
Kafka
WeatherStation
WeatherStation
WeatherStation
Discretized Stream (DStream)
Kafka
WeatherStation
WeatherStation
WeatherStation Discretebytime
IndividualEvent
DStream =RDD
Discretized Stream (DStream)
DStream DStream
XSeconds
Transform
.countByValue()
.reduceByKey()
.join
.map
Discretized Stream (DStream)time1 time2 time3
message
timen….
f(message 1)RDD@time1
f(message 2)
f(message n)
….
message 1RDD@time1
message 2
message n
….
result 1
result 2
result n
….
message message message
f(message 1)RDD@time2
f(message 2)
f(message n)
….
message 1RDD@time2
message 2
message n
….
result 1
result 2
result n
….
f(message 1)RDD@time3
f(message 2)
f(message n)
….
message 1RDD@time3
message 2
message n
….
result 1
result 2
result n
….
f(message 1)RDD@timen
f(message 2)
f(message n)
….
message 1RDD@timen
message 2
message n
….
result 1
result 2
result n
….
InputStream
EventDStream
MappedDStreammap()
saveAsHadoopFiles()
Time Increasing
DStream
Transformation
Lineage
Actio
nsTrig
ger
SparkJobs
Adapted fromChrisFregly: http://slidesha.re/11PP7FV
Apache Spark Streaming – Core concepts
Discretized Stream (DStream)• Core Spark Streaming abstraction
• micro batches of RDD’s
• Operations similar to RDD
Input DStreams• Represents the stream of raw data received
from streaming sources
• Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc.
• Custom Sources can be easily written for custom data sources
Operations• Same as Spark Core + Additional Stateful
transformations (window, reduceByWindow)
Apache Cassandra
Apache Cassandra
Apache Cassandra™ is a free
• Distributed…
• High performance…
• Extremely scalable…
• Fault tolerant (i.e. no single point of failure)…
post-relational database solution
Optimized for high write throughput
Apache Cassandra - HistoryBigtable Dynamo
Motivation - Why NoSQL Databases?
aaa • Dynamo Paper (2007)
• How to build a data store that is
• Reliable
• Performant
• “Always On”
• Nothing new and shiny• 24 other papers cited
• Evolutionary
Motivation - Why NoSQL Databases?
• Google Big Table (2006)
• Richer data model
• 1 key and lot’s of values
• Fast sequential access
• 38 other papers cited
Motivation - Why NoSQL Databases?
• Cassandra Paper (2008)
• Distributed features of Dynamo
• Data Model and storage from BigTable
• February 2010 graduated to a top-level Apache
Project
Apache Cassandra – More than one server
All nodes participate in a clusterShared nothingAdd or remove as neededMore capacity? Add more servers
Node is a basic unit inside a cluster
Each node owns a range of partitionsConsistent Hashing
Node1
Node2
Node3
Node4 [26-50]
[0-25]
[51-75]
[76-100] [0-25]
[0-25][26-50]
[26-50][51-75]
[51-75][76-100]
[76-100]
Apache Cassandra – Fully Replicated
Client writes localData syncs across WANReplication per Data Center
Node1
Node2
Node3
Node4
Node1
Node2
Node3
Node4
WestEastClient
Apache Cassandra
What is Cassandra NOT?
• A Data Ocean• A Data Lake• A Data Pond
• An In-Memory Database
• A Key-Value Store
• Not for Data Warehousing
What are good use cases?
• Product Catalog / Playlists
• Personalization (Ads, Recommendations)
• Fraud Detection
• Time Series (Finance, Smart Meter)
• IoT / Sensor Data
• Graph / Network data
How Cassandra stores data
• Model brought from Google Bigtable• Row Key and a lot of columns• Column names sorted (UTF8, Int, Timestamp, etc.)
ColumnName … Column Name
ColumnValue ColumnValue
Timestamp Timestamp
TTL TTL
RowKey
1 2Billion
Billion
ofR
ows
Combining Spark & Cassandra
Spark and Cassandra Architecture – Great Combo
Goodatanalyzingahugeamountofdata
Goodatstoringahugeamountofdata
Spark and Cassandra Architecture
SparkStreaming(NearReal-Time)
SparkSQL(StructuredData)
MLlib(MachineLearning)
GraphX(GraphAnalysis)
Spark and Cassandra Architecture
SparkConnector
WeatherStation
SparkStreaming(NearReal-Time)
SparkSQL(StructuredData)
MLlib(MachineLearning)
GraphX(GraphAnalysis)
WeatherStation
WeatherStation
WeatherStation
WeatherStation
Spark and Cassandra Architecture
• Single Node running Cassandra
• Spark Worker is really small
• Spark Master lives outside a node
• Spark Worker starts Spark Executer in separate JVM
• Node local
Worker
Master
Executer
Executer
Server
Executer
Spark and Cassandra Architecture
Worker
Worker
Worker
Master
Worker
• Each node runs Spark and Cassandra
• Spark Master can make decisions based on Token Ranges
• Spark likes to work on small partitions of data across a large cluster
• Cassandra likes to spread out data in a large cluster
0-25
26-50
51-75
76-100
Willonly havetoanalyze25%
ofdata!
Spark and Cassandra Architecture
Master0-25
26-50
51-75
76-100
Worker
Worker
WorkerWorker
0-25
26-50
51-75
76-100
Transactional Analytics
Cassandra and Spark
Cassandra Cassandra&Spark
JoinsandUnions No Yes
Transformations Limited Yes
OutsideDataIntegration No Yes
Aggregations Limited Yes
Summary
Summary
Kafka• Topics store information broken into
partitions• Brokers store partitions• Partitions are replicated for data
resilience
Cassandra• Goals of Apache Cassandra are all
about staying online and performant• Best for applications close to your users• Partitions are similar data grouped by a
partition key
Spark• Replacement for Hadoop Map Reduce• In memory• More operations than just Map and Reduce• Makes data analysis easier• Spark Streaming can take a variety of sources
Spark + Cassandra• Cassandra acts as the storage layer for Spark• Deploy in a mixed cluster configuration• Spark executors access Cassandra using the
DataStax connector
Lambda Architecture with Spark/Cassandra
DataCollection
(Analytical)BatchDataProcessing
Batchcompute
ResultStoreDataSources
Channel
DataAccess
Reports
Service
AnalyticTools
AlertingTools
Social
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
Batchcompute
Messaging
ResultStore
QueryEngine
ResultStore
ComputedInformation
RawData(Reservoir)
Lambda Architecture with Spark/Cassandra
DataCollection
(Analytical)BatchDataProcessing
Batchcompute
ResultStoreDataSources
Channel
DataAccess
Reports
Service
AnalyticTools
AlertingTools
Social
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
Batchcompute
Messaging
ResultStore
QueryEngine
ResultStore
ComputedInformation
RawData(Reservoir)
Guido SchmutzTechnology Manager