Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
-
Upload
spark-summit -
Category
Data & Analytics
-
view
1.763 -
download
0
Transcript of Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
©2013 DataStax Confidential. Do not distribute without consent.
@PatrickMcFadin
Patrick McFadinChief Evangelist, DataStax
Spark and Cassandra: An amazing Apache love story
1
• 10T of high frequency event data daily•Constant increasing volume
“The web server that powers the interface can query both datacenters, depending on which the user is closest to,”
“A small set of signals tend to double every eight months. So we needed a model that can scale linearly.”
- Arun Jayandra, Microsoft
RESTAPI
O365
EventHub
IngestionWorker
(AzureworkerroleusingDataStax C#
driver)
C* Analytics
RESTAPI
O365
KafkaC*/Spark
StreamingAnalytics
G4– LocalSSD
Kafka:G4– DataDiskZooKeeper:A7– DataDisk
PaaSSmall
G4– LocalSSD
Cluster1:
Cluster2:
20k – 50k events/sec
200k+ events/sec
Data Protection•Maximilian Schrems v Data Protection Commissioner•No longer OK to ship EU data to US under “Safe
Harbour”
Product_Catalog RF=3Product_Catalog RF=3 Customer_Data RF=3Customer_Data RF=0
Product_Catalog RF=3Customer_Data RF=3
• 300k customers•Report on energy usage• Predict boiler failure
“We’re dealing largely with time series data, and Spark is 10 to 100 times quicker as it is operating on data in-memory…Cassandra delivers what we need today and if you look at the Internet of Things space; that is what is really useful right now.” - Jim Anning, British Gas
Hive Active Heating™
CassandraOnly DC
Cassandra+ Spark DC
Spark Jobs
Spark Streaming
Home Data Center
Hive Active Heating™