Modeling the Smart and Connected City of the Future with Kafka and Spark

download Modeling the Smart and Connected City of the Future with Kafka and Spark

If you can't read please download the document

Embed Size (px)

Transcript of Modeling the Smart and Connected City of the Future with Kafka and Spark

FileNewTemplate

Modeling the Smart and Connected City of the Future with Kafka and Spark Eric Frenkiel, CEO & Co-Founder, MemSQL@ericfrenkiel

MAKE DATA WORKDECEMBER 1-3, 2015 SINGAPORE

1

2MemSQL at a Glance

Enterprise Focused

Our Mission:

Real-time database for transactions and analyticsFounded in 2011, based in San FranciscoFounders are former Facebook, SQL Server database engineers$50 million in funding to date

Make every company a real-time enterprise.

2

What does a Smart City Look Like?

4Our Conception

5Our Reality

63.9b people live in cities today

2.5 billion people to the worlds urban population by 2050, with nearly 90 per cent of the increase concentrated in Asia and Africa.6

7By 2050, well add another 2.5b people

8We need to create sustainable cities

9We need to use technology to help us

10We dont live in Tomorrowland

11We live here

The good news: the Technology of Today can build smart cities.

12

12

13

City-wide WiFiCity App to report issuesOpen-Data Initiatives to share data with the publicMost importantly, an adaptive IT departmentA Smart City Should Have

14Lets learn how.

A Model Application: MemCity

Capturing data from 1.4 million households Total AWS hardware costs at $2.35 per hour

MemCity Reach

1.4 million households (approximately the size of Chicago)

Capturing data from 8 devices in each home, every minute

*

#MemCity

17

186,667 transactions per secondfrom Kafka Spark MemSQL#MemCity

1.4 Million Households8 Devices per Household186K Events per Second

The Real-Time Trinity

Designing the Ideal Real-Time PipelineMessage QueueTransformationSpeed/Serving LayerEnd-to-End Data Pipeline Under One Second

21

A high-throughput distributed messaging system

Publish and subscribe to Kafka topics

Centralized data transport for the organizationKafka

22

In-memory execution engine

High level operators for procedural and programmatic analytics

Faster than MapReduce

Spark

23

In-memory, distributed database

Full transactions and complete durability

Enable real-time, performant applications

MemSQL

24

24

Subscribing to Kafka

(2015-07-06T16:43:40.33Z, 329280, 23, 60) 0111001010101111101111100000001010111100001110101100000010010010111Publish to Kafka Topic011100101010111110111110000000101011110000111010110000001001001011111100101010001010100010101000101111110101000111101011000110101010000101111000011100101010111110001111011010111100000000101110101100000Event added to message queue25

25

Enrich and Transform the Data

Spark polling Kafka for new messages(2015-07-06T16:43:40.33Z, 329280, 23, 60) (2015-07-06T16:43:40.33Z, 329280, 94110, 23, kitchen_appliance, 60) DeserializationEnrichment011100101010111110111110000000101011110000111010110000001001001011126

26

Persist and Prepare for ProductionRDD.saveToMemSQL()INSERT INTO memcity_table ... timehouse_idzipdevice_iddevice_typewatts2015-07-06T16:43:40.33Z3292809411023kitchen_appliance60

27

27

Go to Production Compress development timelines

SELECT ... FROM memcity_table ... 28

28

We can use In-Memory technology to build interactive applications for Cities.

31Urban planningEfficient power consumptionEfficient transportationSustainable energy practices

So We Can Optimize

Creating Real-Time Pipelines should be push button easy.32

One click deployment of integrated Apache Spark Put Spark in the Fast LaneGUI pipeline setupMultiple data pipelinesReal-time transformation Eliminates batch ETLOpen source on GitHub

MemSQL Streamliner for IoT Applications

33

33

Simple Deployment Process

Application

34

Cluster

1. Deploy MemSQL

In-Memory | Distributed | Relational

Application

35

Cluster

2. Deploy Spark

Application

36

Cluster

Kafka Connects to Each Node

Application

37

Streamliner ArchitectureFirst of many integrated Apache Spark solutions

Other Real-Time Data SourcesApplication

Apache SparkFuture SolutionFuture Machine Learning Solution

STREAMLINER38

38

Streamliner ETL Detail

Other Real-Time Data SourcesApplication

Apache SparkFuture SolutionFuture Machine Learning Solution

STREAMLINER

CustomFuture ExtractorJSONCustomFuture Transformer

STREAMLINER

ExtractTransformLoad39

39

Streamliner

40

Extract

41

Transform

42

Load

43

Extending Analytics with Lambda ArchitectureReal-Time AnalyticsStreamingAnalytic ApplicationsNot Excel ReportsFinancial ServicesAdtecheCommerceIoTConsumer InternetEnergyFederalLambda ArchitectureNew Real-Time ProcessingExisting Batch ProcessingMsg Queue

44

45Multi-TB on commodity hardwareStore the state of the modelEasily build applicationsAvoid direct disk at all costIn-Memory Databases Rise Up

Market Guide for In-Memory DBMS, 2015 (October, Edjlali, Feinberg, Jain)45

Comprehensive ArchitectureTransactions46

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

RowstoreTransactions47

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

RowstoreAnalyticsTransactions48

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch LayerFast Appends

ColumnstoreAnalyticsTransactions49

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch LayerFast Appends

ColumnstoreAnalyticsTransactionsExecution engine that spans the data spectrum50

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch LayerFast Appends

Columnstore

AnalyticsTransactions51

Simplified Lambda Architectures with MemSQLLayerTraditional LambdaMemSQL LambdaBatchHadoopMemSQL Column StoreSpeedStorm, SparkKafka > Spark > MemSQLServingCassandra, HBaseMemSQL

52

52

Lambda Applies to Real-Time Data PipelinesMessage QueueBatchInputsDatabaseTransformationApplication53

Kafka, Spark, and MemSQL Make it Simple

BatchInputs

Application

54

Massive Ingest and Concurrent Analytics55Instant accuracy to the latest repinBuild real-time analytic applications1 GB/sec totaling 72 TB/day

Real-time analytics

55

Using Real-Time for Personalization

Ad Servers EC2

Real-time analytics

PostgreSQLLegacy reportsMonitoringS3 (replay)HDFSData ScienceVerticaStar Schema MictoStrategy

Reach overlap and ad optimizationOver 60,000 queries per secondMillisecond response times

56

57300k events/secReduced Latency from 30 minutes to Sub-Second

Real-time Analytics

57

Sample Pipeline: Analyzing Twitter Data in Real Time

Application

Apache Spark

SPARKSTREAMLINERPublic APIGarden Hose

Python

ExtractTransformLoad

SPARK STREAMLINER58

Install MemSQL and Apache Spark in < 1min With MemSQL Ops and Streamliner59

59

Run Kafka in Docker Container and Create a New Topic: TWITTER60

Fill Out Extract, Transform and Load Details to Set Up Pipeline61

Use Python Script to Load Tweets into Kafka Topic and Get Data Flowing62

Connect to MemSQL Database and Run SQL Queries Instantly63

Run Online Alter Table to Optimize Query Performance64

Streamliner: Dynamic Resource ManagementWithout StreamlinerWith StreamlinerPipeline 1

Spark Worker

Pipeline 2

Spark Worker

Executor (P2 only)Executor (P2 only)Executor (P1 only)Executor (P1 only)Driver (P1 only)Driver (P2 only)All Pipelines

Streamliner DriverSpark Worker

Spark Worker

Executor (P1 or P2)Executor (P1 or P2)Executor (P1 or P2)Executor (P1 or P2)65

Building Real-Time Data Pipelines and Predictive Applications66

66

Adding Real-Time Scoring to Predictive Applications Streamliner

InputUser JarSAS Generated PMML

Industrial Equipment Sensor Data

S1S2S3P1P2P3

Scoring Real-Time Data with Predictive ModelsSensor 1Predictive Model 167

68

GET YOUR FREE COPY:memsql.com/oreilly69