Modeling the Smart and Connected City of the Future with Kafka and Spark
date post
15-Apr-2017Category
Data & Analytics
view
1.328download
2
Embed Size (px)
Transcript of Modeling the Smart and Connected City of the Future with Kafka and Spark
FileNewTemplate
Modeling the Smart and Connected City of the Future with Kafka and Spark Eric Frenkiel, CEO & Co-Founder, MemSQL@ericfrenkiel
MAKE DATA WORKDECEMBER 1-3, 2015 SINGAPORE
1
2MemSQL at a Glance
Enterprise Focused
Our Mission:
Real-time database for transactions and analyticsFounded in 2011, based in San FranciscoFounders are former Facebook, SQL Server database engineers$50 million in funding to date
Make every company a real-time enterprise.
2
What does a Smart City Look Like?
4Our Conception
5Our Reality
63.9b people live in cities today
2.5 billion people to the worlds urban population by 2050, with nearly 90 per cent of the increase concentrated in Asia and Africa.6
7By 2050, well add another 2.5b people
8We need to create sustainable cities
9We need to use technology to help us
10We dont live in Tomorrowland
11We live here
The good news: the Technology of Today can build smart cities.
12
12
13
City-wide WiFiCity App to report issuesOpen-Data Initiatives to share data with the publicMost importantly, an adaptive IT departmentA Smart City Should Have
14Lets learn how.
A Model Application: MemCity
Capturing data from 1.4 million households Total AWS hardware costs at $2.35 per hour
MemCity Reach
1.4 million households (approximately the size of Chicago)
Capturing data from 8 devices in each home, every minute
*
#MemCity
17
186,667 transactions per secondfrom Kafka Spark MemSQL#MemCity
1.4 Million Households8 Devices per Household186K Events per Second
The Real-Time Trinity
Designing the Ideal Real-Time PipelineMessage QueueTransformationSpeed/Serving LayerEnd-to-End Data Pipeline Under One Second
21
A high-throughput distributed messaging system
Publish and subscribe to Kafka topics
Centralized data transport for the organizationKafka
22
In-memory execution engine
High level operators for procedural and programmatic analytics
Faster than MapReduce
Spark
23
In-memory, distributed database
Full transactions and complete durability
Enable real-time, performant applications
MemSQL
24
24
Subscribing to Kafka
(2015-07-06T16:43:40.33Z, 329280, 23, 60) 0111001010101111101111100000001010111100001110101100000010010010111Publish to Kafka Topic011100101010111110111110000000101011110000111010110000001001001011111100101010001010100010101000101111110101000111101011000110101010000101111000011100101010111110001111011010111100000000101110101100000Event added to message queue25
25
Enrich and Transform the Data
Spark polling Kafka for new messages(2015-07-06T16:43:40.33Z, 329280, 23, 60) (2015-07-06T16:43:40.33Z, 329280, 94110, 23, kitchen_appliance, 60) DeserializationEnrichment011100101010111110111110000000101011110000111010110000001001001011126
26
Persist and Prepare for ProductionRDD.saveToMemSQL()INSERT INTO memcity_table ... timehouse_idzipdevice_iddevice_typewatts2015-07-06T16:43:40.33Z3292809411023kitchen_appliance60
27
27
Go to Production Compress development timelines
SELECT ... FROM memcity_table ... 28
28
We can use In-Memory technology to build interactive applications for Cities.
31Urban planningEfficient power consumptionEfficient transportationSustainable energy practices
So We Can Optimize
Creating Real-Time Pipelines should be push button easy.32
One click deployment of integrated Apache Spark Put Spark in the Fast LaneGUI pipeline setupMultiple data pipelinesReal-time transformation Eliminates batch ETLOpen source on GitHub
MemSQL Streamliner for IoT Applications
33
33
Simple Deployment Process
Application
34
Cluster
1. Deploy MemSQL
In-Memory | Distributed | Relational
Application
35
Cluster
2. Deploy Spark
Application
36
Cluster
Kafka Connects to Each Node
Application
37
Streamliner ArchitectureFirst of many integrated Apache Spark solutions
Other Real-Time Data SourcesApplication
Apache SparkFuture SolutionFuture Machine Learning Solution
STREAMLINER38
38
Streamliner ETL Detail
Other Real-Time Data SourcesApplication
Apache SparkFuture SolutionFuture Machine Learning Solution
STREAMLINER
CustomFuture ExtractorJSONCustomFuture Transformer
STREAMLINER
ExtractTransformLoad39
39
Streamliner
40
Extract
41
Transform
42
Load
43
Extending Analytics with Lambda ArchitectureReal-Time AnalyticsStreamingAnalytic ApplicationsNot Excel ReportsFinancial ServicesAdtecheCommerceIoTConsumer InternetEnergyFederalLambda ArchitectureNew Real-Time ProcessingExisting Batch ProcessingMsg Queue
44
45Multi-TB on commodity hardwareStore the state of the modelEasily build applicationsAvoid direct disk at all costIn-Memory Databases Rise Up
Market Guide for In-Memory DBMS, 2015 (October, Edjlali, Feinberg, Jain)45
Comprehensive ArchitectureTransactions46
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
RowstoreTransactions47
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
RowstoreAnalyticsTransactions48
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch LayerFast Appends
ColumnstoreAnalyticsTransactions49
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch LayerFast Appends
ColumnstoreAnalyticsTransactionsExecution engine that spans the data spectrum50
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch LayerFast Appends
Columnstore
AnalyticsTransactions51
Simplified Lambda Architectures with MemSQLLayerTraditional LambdaMemSQL LambdaBatchHadoopMemSQL Column StoreSpeedStorm, SparkKafka > Spark > MemSQLServingCassandra, HBaseMemSQL
52
52
Lambda Applies to Real-Time Data PipelinesMessage QueueBatchInputsDatabaseTransformationApplication53
Kafka, Spark, and MemSQL Make it Simple
BatchInputs
Application
54
Massive Ingest and Concurrent Analytics55Instant accuracy to the latest repinBuild real-time analytic applications1 GB/sec totaling 72 TB/day
Real-time analytics
55
Using Real-Time for Personalization
Ad Servers EC2
Real-time analytics
PostgreSQLLegacy reportsMonitoringS3 (replay)HDFSData ScienceVerticaStar Schema MictoStrategy
Reach overlap and ad optimizationOver 60,000 queries per secondMillisecond response times
56
57300k events/secReduced Latency from 30 minutes to Sub-Second
Real-time Analytics
57
Sample Pipeline: Analyzing Twitter Data in Real Time
Application
Apache Spark
SPARKSTREAMLINERPublic APIGarden Hose
Python
ExtractTransformLoad
SPARK STREAMLINER58
Install MemSQL and Apache Spark in < 1min With MemSQL Ops and Streamliner59
59
Run Kafka in Docker Container and Create a New Topic: TWITTER60
Fill Out Extract, Transform and Load Details to Set Up Pipeline61
Use Python Script to Load Tweets into Kafka Topic and Get Data Flowing62
Connect to MemSQL Database and Run SQL Queries Instantly63
Run Online Alter Table to Optimize Query Performance64
Streamliner: Dynamic Resource ManagementWithout StreamlinerWith StreamlinerPipeline 1
Spark Worker
Pipeline 2
Spark Worker
Executor (P2 only)Executor (P2 only)Executor (P1 only)Executor (P1 only)Driver (P1 only)Driver (P2 only)All Pipelines
Streamliner DriverSpark Worker
Spark Worker
Executor (P1 or P2)Executor (P1 or P2)Executor (P1 or P2)Executor (P1 or P2)65
Building Real-Time Data Pipelines and Predictive Applications66
66
Adding Real-Time Scoring to Predictive Applications Streamliner
InputUser JarSAS Generated PMML
Industrial Equipment Sensor Data
S1S2S3P1P2P3
Scoring Real-Time Data with Predictive ModelsSensor 1Predictive Model 167
68
GET YOUR FREE COPY:memsql.com/oreilly69