Modeling the Smart and Connected City of the Future with Kafka and Spark

69
Modeling the Smart and Connected City of the Future with Kafka and Spark Eric Frenkiel, CEO & Co-Founder, MemSQL @ericfrenkiel MAKE DATA WORK DECEMBER 1-3, 2015 SINGAPORE

Transcript of Modeling the Smart and Connected City of the Future with Kafka and Spark

Page 1: Modeling the Smart and Connected City of the Future with Kafka and Spark

Modeling the Smart and Connected City of the Future with Kafka and Spark

Eric Frenkiel, CEO & Co-Founder, MemSQL@ericfrenkiel

MAKE DATA WORKDECEMBER 1-3, 2015 SINGAPORE

Page 2: Modeling the Smart and Connected City of the Future with Kafka and Spark

2

MemSQL at a Glance

Enterprise Focused

Our Mission:

Real-time database for transactions and analytics Founded in 2011, based in San Francisco Founders are former Facebook, SQL Server

database engineers $50 million in funding to date

Make every company a real-time enterprise.

Page 3: Modeling the Smart and Connected City of the Future with Kafka and Spark

What does a Smart City Look Like?

Page 4: Modeling the Smart and Connected City of the Future with Kafka and Spark

4

Our Conception

Page 5: Modeling the Smart and Connected City of the Future with Kafka and Spark

5

Our Reality

Page 6: Modeling the Smart and Connected City of the Future with Kafka and Spark

6

3.9b people live in cities today

Page 7: Modeling the Smart and Connected City of the Future with Kafka and Spark

7

By 2050, we’ll add another 2.5b people

Page 8: Modeling the Smart and Connected City of the Future with Kafka and Spark

8

We need to create sustainable cities

Page 9: Modeling the Smart and Connected City of the Future with Kafka and Spark

9

We need to use technology to help us

Page 10: Modeling the Smart and Connected City of the Future with Kafka and Spark

10

We don’t live in Tomorrowland

Page 11: Modeling the Smart and Connected City of the Future with Kafka and Spark

11

We live here

Page 12: Modeling the Smart and Connected City of the Future with Kafka and Spark

12

The good news: the Technology of Today can build smart cities.

Page 13: Modeling the Smart and Connected City of the Future with Kafka and Spark

13

City-wide WiFi City App to report issues Open-Data Initiatives to

share data with the public Most importantly, an

adaptive IT department

A Smart City Should Have…

Page 14: Modeling the Smart and Connected City of the Future with Kafka and Spark

14

Let’s learn how.

Page 15: Modeling the Smart and Connected City of the Future with Kafka and Spark

A Model Application: MemCityCapturing data from 1.4 million households Total AWS hardware costs at $2.35 per hour

Page 16: Modeling the Smart and Connected City of the Future with Kafka and Spark

MemCity Reach

1.4 million households (approximately the size of Chicago)

Page 17: Modeling the Smart and Connected City of the Future with Kafka and Spark

Capturing data from 8 devices in each home,

every minute

*#MemCity

Page 18: Modeling the Smart and Connected City of the Future with Kafka and Spark

186,667 transactions per secondfrom Kafka Spark MemSQL

#MemCity

Page 19: Modeling the Smart and Connected City of the Future with Kafka and Spark

1.4 Million Households8 Devices per Household186K Events per Second

Page 20: Modeling the Smart and Connected City of the Future with Kafka and Spark

The “Real-Time Trinity”

Page 21: Modeling the Smart and Connected City of the Future with Kafka and Spark

21

Designing the Ideal Real-Time Pipeline

Message Queue Transformation Speed/Serving Layer

End-to-End Data Pipeline Under One Second

Page 22: Modeling the Smart and Connected City of the Future with Kafka and Spark

22

A high-throughput distributed messaging system

Publish and subscribe to Kafka “topics”

Centralized data transport for the organization

Kafka

Page 23: Modeling the Smart and Connected City of the Future with Kafka and Spark

23

In-memory execution engine

High level operators for procedural and programmatic analytics

Faster than MapReduce

Spark

Page 24: Modeling the Smart and Connected City of the Future with Kafka and Spark

24

In-memory, distributed database

Full transactions and complete durability

Enable real-time, performant applications

MemSQL

Page 25: Modeling the Smart and Connected City of the Future with Kafka and Spark

25

Subscribing to Kafka

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

0111001010101111101111100000001010111100001110101100000010010010111…

Publish to Kafka Topic

0111001010101111101111100000001010111100001110101100000010010010111…

1110010101000101010001010100010111111010100011110101100011010101000…

0101111000011100101010111110001111011010111100000000101110101100000…

Event added to message queue

Page 26: Modeling the Smart and Connected City of the Future with Kafka and Spark

26

Enrich and Transform the Data

Spark polling Kafka for new messages

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

(2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60)

Deserialization

Enrichment

0111001010101111101111100000001010111100001110101100000010010010111…

Page 27: Modeling the Smart and Connected City of the Future with Kafka and Spark

27

Persist and Prepare for Production

RDD.saveToMemSQL()

INSERT INTO memcity_table ...

time house_id zip device_id device_type watts

2015-07-

06T16:43:40.33

Z

329280 94110 23 ‘kitchen_appliance’ 60

… … … … … …

Page 28: Modeling the Smart and Connected City of the Future with Kafka and Spark

28

Go to Production

Compress development timelines

SELECT ... FROM memcity_table ...

Page 29: Modeling the Smart and Connected City of the Future with Kafka and Spark
Page 30: Modeling the Smart and Connected City of the Future with Kafka and Spark

We can use In-Memory technology to build interactive applications for Cities.

Page 31: Modeling the Smart and Connected City of the Future with Kafka and Spark

31

Urban planning Efficient power consumption Efficient transportation Sustainable energy practices

So We Can Optimize…

Page 32: Modeling the Smart and Connected City of the Future with Kafka and Spark

32

Creating Real-Time Pipelines should be push button

easy.

Page 33: Modeling the Smart and Connected City of the Future with Kafka and Spark

33

One click deployment of integrated Apache Spark

Put Spark in the Fast Lane• GUI pipeline setup• Multiple data pipelines• Real-time transformation

Eliminates batch ETL Open source on GitHub

MemSQL Streamliner for IoT Applications

Page 34: Modeling the Smart and Connected City of the Future with Kafka and Spark

34

Simple Deployment Process

Application

Page 35: Modeling the Smart and Connected City of the Future with Kafka and Spark

35

Cluster

1. Deploy MemSQL

In-Memory | Distributed | Relational

Application

Page 36: Modeling the Smart and Connected City of the Future with Kafka and Spark

36

Cluster

2. Deploy Spark

Application

Page 37: Modeling the Smart and Connected City of the Future with Kafka and Spark

37

Cluster

Kafka Connects to Each Node

Application

Page 38: Modeling the Smart and Connected City of the Future with Kafka and Spark

38

Streamliner Architecture

First of many integrated Apache Spark solutions

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Page 39: Modeling the Smart and Connected City of the Future with Kafka and Spark

39

Streamliner ETL Detail

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Custom

Future Extractor

JSON

Custom

Future Transformer

STREAMLINER

Extract Transform Load

Page 40: Modeling the Smart and Connected City of the Future with Kafka and Spark

40

Streamliner

Page 41: Modeling the Smart and Connected City of the Future with Kafka and Spark

41

Extract

Page 42: Modeling the Smart and Connected City of the Future with Kafka and Spark

42

Transform

Page 43: Modeling the Smart and Connected City of the Future with Kafka and Spark

43

Load

Page 44: Modeling the Smart and Connected City of the Future with Kafka and Spark

Extending Analytics with Lambda Architecture

Real-Time Analytics StreamingAnalytic ApplicationsNot Excel Reports

Financial Services Adtech eCommerce IoT Consumer Internet Energy Federal

Lambda Architecture

New Real-Time Processing

Existing Batch Processing

Msg Queue

Page 45: Modeling the Smart and Connected City of the Future with Kafka and Spark

45

Multi-TB on commodity hardware

Store the “state of the model”

Easily build applications

Avoid direct disk at all cost

In-Memory Databases Rise Up

Page 46: Modeling the Smart and Connected City of the Future with Kafka and Spark

46

Comprehensive Architecture

Tran

sact

ions

Page 47: Modeling the Smart and Connected City of the Future with Kafka and Spark

47

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

RowstoreTran

sact

ions

Page 48: Modeling the Smart and Connected City of the Future with Kafka and Spark

48

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

Analytics

Tran

sact

ions

Page 49: Modeling the Smart and Connected City of the Future with Kafka and Spark

49

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tran

sact

ions

Page 50: Modeling the Smart and Connected City of the Future with Kafka and Spark

50

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tran

sact

ions

Execution engine that spans the data spectrum

Page 51: Modeling the Smart and Connected City of the Future with Kafka and Spark

51

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tran

sact

ions

Page 52: Modeling the Smart and Connected City of the Future with Kafka and Spark

52

Simplified Lambda Architectures with MemSQL

Layer Traditional Lambda MemSQL Lambda

Batch Hadoop MemSQL Column Store

Speed Storm, Spark Kafka > Spark > MemSQL

Serving Cassandra, HBase MemSQL

Page 53: Modeling the Smart and Connected City of the Future with Kafka and Spark

53

Lambda Applies to Real-Time Data Pipelines

Message Queue

Batch

Inputs DatabaseTransformation Application

Page 54: Modeling the Smart and Connected City of the Future with Kafka and Spark

54

Kafka, Spark, and MemSQL Make it Simple

Batch

Inputs Application

Page 55: Modeling the Smart and Connected City of the Future with Kafka and Spark

55

Massive Ingest and Concurrent Analytics Instant accuracy to the latest repin Build real-time analytic applications 1 GB/sec totaling 72 TB/day

Real-time analytics

Page 56: Modeling the Smart and Connected City of the Future with Kafka and Spark

56

Using Real-Time for Personalization

Ad Servers EC2

Real-time analytics

PostgreSQLLegacy reports

Monitoring S3 (replay) HDFSData Science

VerticaStar Schema MictoStrategy

Reach overlap and ad optimization Over 60,000 queries per second Millisecond response times

Page 57: Modeling the Smart and Connected City of the Future with Kafka and Spark

57

300k events/sec

Reduced Latency from 30 minutes to Sub-Second

Real-time Analytics

Page 58: Modeling the Smart and Connected City of the Future with Kafka and Spark

58

Sample Pipeline: Analyzing Twitter Data in Real Time

ApplicationApache Spark

SPARKSTREAMLINER

Public API“Garden Hose”

</>Python Extract Transform Load

SPARK STREAMLINER

Page 59: Modeling the Smart and Connected City of the Future with Kafka and Spark

59

Install MemSQL and Apache Spark in < 1min With MemSQL Ops and Streamliner

Page 60: Modeling the Smart and Connected City of the Future with Kafka and Spark

60

Run Kafka in Docker Container and Create a New Topic: TWITTER

Page 61: Modeling the Smart and Connected City of the Future with Kafka and Spark

61

Fill Out Extract, Transform and Load Details to Set Up Pipeline

Page 62: Modeling the Smart and Connected City of the Future with Kafka and Spark

62

Use Python Script to Load Tweets into Kafka Topic and Get Data Flowing

Page 63: Modeling the Smart and Connected City of the Future with Kafka and Spark

63

Connect to MemSQL Database and Run SQL Queries Instantly

Page 64: Modeling the Smart and Connected City of the Future with Kafka and Spark

64

Run Online Alter Table to Optimize Query Performance

Page 65: Modeling the Smart and Connected City of the Future with Kafka and Spark

65

Streamliner: Dynamic Resource ManagementWithout Streamliner With StreamlinerPipeline 1

Spark Worker

Pipeline 2

Spark Worker

Executor (P2 only)

Executor (P2 only)

Executor (P1 only)

Executor (P1 only)

Driver (P1 only)

Driver (P2 only)

All Pipelines

Streamliner Driver…

Spark WorkerSpark Worker

Executor (P1 or P2)

Executor (P1 or P2)

Executor (P1 or P2)

Executor (P1 or P2)

Page 66: Modeling the Smart and Connected City of the Future with Kafka and Spark

66

Building Real-Time Data Pipelines and Predictive Applications

Page 67: Modeling the Smart and Connected City of the Future with Kafka and Spark

67

Adding Real-Time Scoring to Predictive Applications

StreamlinerInput

User JarSAS Generated PMML

Industrial Equipment

Sensor Data

S1 S2 S3 P1 P2 P3

Scoring Real-Time Data with Predictive Models

Sensor 1 Predictive Model 1

Page 68: Modeling the Smart and Connected City of the Future with Kafka and Spark

68

Page 69: Modeling the Smart and Connected City of the Future with Kafka and Spark

69

GET YOUR FREE COPY:memsql.com/oreilly