Modeling the Smart and Connected City of the Future with Kafka and Spark

Post on 15-Apr-2017

1.330 views 2 download

Transcript of Modeling the Smart and Connected City of the Future with Kafka and Spark

Modeling the Smart and Connected City of the Future with Kafka and Spark

Eric Frenkiel, CEO & Co-Founder, MemSQL@ericfrenkiel

MAKE DATA WORKDECEMBER 1-3, 2015 SINGAPORE

2

MemSQL at a Glance

Enterprise Focused

Our Mission:

Real-time database for transactions and analytics Founded in 2011, based in San Francisco Founders are former Facebook, SQL Server

database engineers $50 million in funding to date

Make every company a real-time enterprise.

What does a Smart City Look Like?

4

Our Conception

5

Our Reality

6

3.9b people live in cities today

7

By 2050, we’ll add another 2.5b people

8

We need to create sustainable cities

9

We need to use technology to help us

10

We don’t live in Tomorrowland

11

We live here

12

The good news: the Technology of Today can build smart cities.

13

City-wide WiFi City App to report issues Open-Data Initiatives to

share data with the public Most importantly, an

adaptive IT department

A Smart City Should Have…

14

Let’s learn how.

A Model Application: MemCityCapturing data from 1.4 million households Total AWS hardware costs at $2.35 per hour

MemCity Reach

1.4 million households (approximately the size of Chicago)

Capturing data from 8 devices in each home,

every minute

*#MemCity

186,667 transactions per secondfrom Kafka Spark MemSQL

#MemCity

1.4 Million Households8 Devices per Household186K Events per Second

The “Real-Time Trinity”

21

Designing the Ideal Real-Time Pipeline

Message Queue Transformation Speed/Serving Layer

End-to-End Data Pipeline Under One Second

22

A high-throughput distributed messaging system

Publish and subscribe to Kafka “topics”

Centralized data transport for the organization

Kafka

23

In-memory execution engine

High level operators for procedural and programmatic analytics

Faster than MapReduce

Spark

24

In-memory, distributed database

Full transactions and complete durability

Enable real-time, performant applications

MemSQL

25

Subscribing to Kafka

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

0111001010101111101111100000001010111100001110101100000010010010111…

Publish to Kafka Topic

0111001010101111101111100000001010111100001110101100000010010010111…

1110010101000101010001010100010111111010100011110101100011010101000…

0101111000011100101010111110001111011010111100000000101110101100000…

Event added to message queue

26

Enrich and Transform the Data

Spark polling Kafka for new messages

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

(2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60)

Deserialization

Enrichment

0111001010101111101111100000001010111100001110101100000010010010111…

27

Persist and Prepare for Production

RDD.saveToMemSQL()

INSERT INTO memcity_table ...

time house_id zip device_id device_type watts

2015-07-

06T16:43:40.33

Z

329280 94110 23 ‘kitchen_appliance’ 60

… … … … … …

28

Go to Production

Compress development timelines

SELECT ... FROM memcity_table ...

We can use In-Memory technology to build interactive applications for Cities.

31

Urban planning Efficient power consumption Efficient transportation Sustainable energy practices

So We Can Optimize…

32

Creating Real-Time Pipelines should be push button

easy.

33

One click deployment of integrated Apache Spark

Put Spark in the Fast Lane• GUI pipeline setup• Multiple data pipelines• Real-time transformation

Eliminates batch ETL Open source on GitHub

MemSQL Streamliner for IoT Applications

34

Simple Deployment Process

Application

35

Cluster

1. Deploy MemSQL

In-Memory | Distributed | Relational

Application

36

Cluster

2. Deploy Spark

Application

37

Cluster

Kafka Connects to Each Node

Application

38

Streamliner Architecture

First of many integrated Apache Spark solutions

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

39

Streamliner ETL Detail

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Custom

Future Extractor

JSON

Custom

Future Transformer

STREAMLINER

Extract Transform Load

40

Streamliner

41

Extract

42

Transform

43

Load

Extending Analytics with Lambda Architecture

Real-Time Analytics StreamingAnalytic ApplicationsNot Excel Reports

Financial Services Adtech eCommerce IoT Consumer Internet Energy Federal

Lambda Architecture

New Real-Time Processing

Existing Batch Processing

Msg Queue

45

Multi-TB on commodity hardware

Store the “state of the model”

Easily build applications

Avoid direct disk at all cost

In-Memory Databases Rise Up

46

Comprehensive Architecture

Tran

sact

ions

47

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

RowstoreTran

sact

ions

48

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

Analytics

Tran

sact

ions

49

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tran

sact

ions

50

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tran

sact

ions

Execution engine that spans the data spectrum

51

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tran

sact

ions

52

Simplified Lambda Architectures with MemSQL

Layer Traditional Lambda MemSQL Lambda

Batch Hadoop MemSQL Column Store

Speed Storm, Spark Kafka > Spark > MemSQL

Serving Cassandra, HBase MemSQL

53

Lambda Applies to Real-Time Data Pipelines

Message Queue

Batch

Inputs DatabaseTransformation Application

54

Kafka, Spark, and MemSQL Make it Simple

Batch

Inputs Application

55

Massive Ingest and Concurrent Analytics Instant accuracy to the latest repin Build real-time analytic applications 1 GB/sec totaling 72 TB/day

Real-time analytics

56

Using Real-Time for Personalization

Ad Servers EC2

Real-time analytics

PostgreSQLLegacy reports

Monitoring S3 (replay) HDFSData Science

VerticaStar Schema MictoStrategy

Reach overlap and ad optimization Over 60,000 queries per second Millisecond response times

57

300k events/sec

Reduced Latency from 30 minutes to Sub-Second

Real-time Analytics

58

Sample Pipeline: Analyzing Twitter Data in Real Time

ApplicationApache Spark

SPARKSTREAMLINER

Public API“Garden Hose”

</>Python Extract Transform Load

SPARK STREAMLINER

59

Install MemSQL and Apache Spark in < 1min With MemSQL Ops and Streamliner

60

Run Kafka in Docker Container and Create a New Topic: TWITTER

61

Fill Out Extract, Transform and Load Details to Set Up Pipeline

62

Use Python Script to Load Tweets into Kafka Topic and Get Data Flowing

63

Connect to MemSQL Database and Run SQL Queries Instantly

64

Run Online Alter Table to Optimize Query Performance

65

Streamliner: Dynamic Resource ManagementWithout Streamliner With StreamlinerPipeline 1

Spark Worker

Pipeline 2

Spark Worker

Executor (P2 only)

Executor (P2 only)

Executor (P1 only)

Executor (P1 only)

Driver (P1 only)

Driver (P2 only)

All Pipelines

Streamliner Driver…

Spark WorkerSpark Worker

Executor (P1 or P2)

Executor (P1 or P2)

Executor (P1 or P2)

Executor (P1 or P2)

66

Building Real-Time Data Pipelines and Predictive Applications

67

Adding Real-Time Scoring to Predictive Applications

StreamlinerInput

User JarSAS Generated PMML

Industrial Equipment

Sensor Data

S1 S2 S3 P1 P2 P3

Scoring Real-Time Data with Predictive Models

Sensor 1 Predictive Model 1

68

69

GET YOUR FREE COPY:memsql.com/oreilly