Modeling the Smart and Connected City of the Future with Kafka and Spark

Eric Frenkiel, CEO & Co-Founder, MemSQL@ericfrenkiel

MAKE DATA WORKDECEMBER 1-3, 2015 SINGAPORE

MemSQL at a Glance

Enterprise Focused

Our Mission:

Real-time database for transactions and analytics Founded in 2011, based in San Francisco Founders are former Facebook, SQL Server

database engineers $50 million in funding to date

Make every company a real-time enterprise.

What does a Smart City Look Like?

Our Conception

Our Reality

3.9b people live in cities today

By 2050, we’ll add another 2.5b people

We need to create sustainable cities

We need to use technology to help us

We don’t live in Tomorrowland

We live here

The good news: the Technology of Today can build smart cities.

City-wide WiFi City App to report issues Open-Data Initiatives to

share data with the public Most importantly, an

adaptive IT department

A Smart City Should Have…

Let’s learn how.

A Model Application: MemCityCapturing data from 1.4 million households Total AWS hardware costs at $2.35 per hour

MemCity Reach

1.4 million households (approximately the size of Chicago)

Capturing data from 8 devices in each home,

every minute

*#MemCity

186,667 transactions per secondfrom Kafka Spark MemSQL

#MemCity

1.4 Million Households8 Devices per Household186K Events per Second

The “Real-Time Trinity”

Designing the Ideal Real-Time Pipeline

Message Queue Transformation Speed/Serving Layer

End-to-End Data Pipeline Under One Second

A high-throughput distributed messaging system

Publish and subscribe to Kafka “topics”

Centralized data transport for the organization

In-memory execution engine

High level operators for procedural and programmatic analytics

Faster than MapReduce

In-memory, distributed database

Full transactions and complete durability

Enable real-time, performant applications

MemSQL

Subscribing to Kafka

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

0111001010101111101111100000001010111100001110101100000010010010111…

Publish to Kafka Topic

0111001010101111101111100000001010111100001110101100000010010010111…

1110010101000101010001010100010111111010100011110101100011010101000…

0101111000011100101010111110001111011010111100000000101110101100000…

Event added to message queue

Enrich and Transform the Data

Spark polling Kafka for new messages

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

(2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60)

Deserialization

Enrichment

0111001010101111101111100000001010111100001110101100000010010010111…

Persist and Prepare for Production

RDD.saveToMemSQL()

INSERT INTO memcity_table ...

time house_id zip device_id device_type watts

2015-07-

06T16:43:40.33

329280 94110 23 ‘kitchen_appliance’ 60

… … … … … …

Go to Production

Compress development timelines

SELECT ... FROM memcity_table ...

We can use In-Memory technology to build interactive applications for Cities.

Urban planning Efficient power consumption Efficient transportation Sustainable energy practices

So We Can Optimize…

Creating Real-Time Pipelines should be push button

One click deployment of integrated Apache Spark

Put Spark in the Fast Lane• GUI pipeline setup• Multiple data pipelines• Real-time transformation

Eliminates batch ETL Open source on GitHub

MemSQL Streamliner for IoT Applications

Simple Deployment Process

Application

Cluster

1. Deploy MemSQL

In-Memory | Distributed | Relational

Application

Cluster

2. Deploy Spark

Application

Cluster

Kafka Connects to Each Node

Application

Streamliner Architecture

First of many integrated Apache Spark solutions

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Streamliner ETL Detail

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Custom

Future Extractor

Custom

Future Transformer

STREAMLINER

Extract Transform Load

Streamliner

Extract

Transform

Extending Analytics with Lambda Architecture

Real-Time Analytics StreamingAnalytic ApplicationsNot Excel Reports

Financial Services Adtech eCommerce IoT Consumer Internet Energy Federal

Lambda Architecture

New Real-Time Processing

Existing Batch Processing

Msg Queue

Multi-TB on commodity hardware

Store the “state of the model”

Easily build applications

Avoid direct disk at all cost

In-Memory Databases Rise Up

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

RowstoreTran

Rowstore

Analytics

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Rowstore

Fast Appends

Columnstore

Analytics

Execution engine that spans the data spectrum

Rowstore

Fast Appends

Columnstore

Analytics

Simplified Lambda Architectures with MemSQL

Layer Traditional Lambda MemSQL Lambda

Batch Hadoop MemSQL Column Store

Speed Storm, Spark Kafka > Spark > MemSQL

Serving Cassandra, HBase MemSQL

Lambda Applies to Real-Time Data Pipelines

Message Queue

Inputs DatabaseTransformation Application

Kafka, Spark, and MemSQL Make it Simple

Inputs Application

Massive Ingest and Concurrent Analytics Instant accuracy to the latest repin Build real-time analytic applications 1 GB/sec totaling 72 TB/day

Real-time analytics

Using Real-Time for Personalization

Ad Servers EC2

Real-time analytics

PostgreSQLLegacy reports

Monitoring S3 (replay) HDFSData Science

VerticaStar Schema MictoStrategy

Reach overlap and ad optimization Over 60,000 queries per second Millisecond response times

300k events/sec

Reduced Latency from 30 minutes to Sub-Second

Real-time Analytics

Sample Pipeline: Analyzing Twitter Data in Real Time

ApplicationApache Spark

SPARKSTREAMLINER

Public API“Garden Hose”

</>Python Extract Transform Load

SPARK STREAMLINER

Install MemSQL and Apache Spark in < 1min With MemSQL Ops and Streamliner

Run Kafka in Docker Container and Create a New Topic: TWITTER

Fill Out Extract, Transform and Load Details to Set Up Pipeline

Use Python Script to Load Tweets into Kafka Topic and Get Data Flowing

Connect to MemSQL Database and Run SQL Queries Instantly

Run Online Alter Table to Optimize Query Performance

Streamliner: Dynamic Resource ManagementWithout Streamliner With StreamlinerPipeline 1

Spark Worker

Pipeline 2

Spark Worker

Executor (P2 only)

Executor (P1 only)

Driver (P1 only)

Driver (P2 only)

All Pipelines

Streamliner Driver…

Spark WorkerSpark Worker

Executor (P1 or P2)

Building Real-Time Data Pipelines and Predictive Applications

Adding Real-Time Scoring to Predictive Applications

StreamlinerInput

User JarSAS Generated PMML

Industrial Equipment

Sensor Data

S1 S2 S3 P1 P2 P3

Scoring Real-Time Data with Predictive Models

Sensor 1 Predictive Model 1

GET YOUR FREE COPY:memsql.com/oreilly

Modeling the Smart and Connected City of the Future with Kafka and Spark

Data & Analytics

Transcript of Modeling the Smart and Connected City of the Future with Kafka and Spark

Streaming Analytics with Spark, Kafka, Cassandra and Akka

Meet Up - Spark Stream Processing + Kafka

Kafka pours and Spark resolves

D19 Kafka, Spark StreamingHow to guarantee delivery producer -> Kafka Idempotent Delivery - Kafka 0.11.0 Each producer Has an ID Adds sequence number to messages Kafka server checks

Spark Streaming & Kafka-The Future of Stream Processing

Feeding Cassandra with Spark-Streaming and Kafka

Apache Kafka - RainFocus · Apache Kafka Scalable Message ... Introduction& Motivation Apache Kafka -Scalable Message Processing and more! Apache Kafka -Overview ... • Apache Spark

Kafka spark cassandra webinar feb 16 2016

Kafka spark - cassandra

Streaming Big Data with Spark Streaming, Kafka, Cassandra ...chariotsolutions.com/wp-content/uploads/2015/04/HelenaEdelson_ETE... · Streaming Big Data with Spark Streaming, Kafka,

Streaming Big Data with Spark Streaming, Kafka, Cassandra and … · 2017-08-23 · Asynchronous Data Passing Kafka, Akka, Spark Fast, Low Latency, Data Locality Cassandra, Spark,

Using Apache Spark, Apache Kafka and Apache Cassandra...USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS | 02 Apache Cassandra is well known

Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop

Real time Analytics with Apache Kafka and Apache Spark

Scaling with Couchbase, Kafka and Apache Spark

Real Time Aggregation with Kafka ,Spark Streaming and ...

Spark streaming with apache kafka

Spark stream - Kafka

Spark streaming with kafka

Building Real-Time Data Pipelines with Kafka, Spark, and MemSQL