Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

37
Eric Frenkiel, MemSQL CEO and co- founder August 11, 2015 • San Diego, CA Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Transcript of Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Page 1: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Eric Frenkiel, MemSQL CEO and co-founder

August 11, 2015 • San Diego, CA

Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Page 2: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

2

What’s In Store

MemSQL and a fresh look at Lambda architectures

Building real-time data pipelines for immediate impact

One architecture for many applications

Page 3: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

3

MemSQL at a Glance

• Enable every company to be a real-time enterprise

• Founded 2011, based in San Francisco

• Founders are ex-Facebook, SQL Server engineers

• Deliver a database technology for modern architecture

Enterprise Focus

Page 4: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

4

The Real-Time Database for Transactions and Analytics 

In-Memory Distributed Relational

Data CenterSoftware Cloud

Page 5: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

5

Speed

Serving

Batch Fast Updates

Unified queries, full SQL

Fast Appends

A Fresh Look at Lambda Architectures

Page 6: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

6

Comprehensive Architecture

Tra

nsac

tions

Page 7: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

7

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

RowstoreTra

nsac

tions

Page 8: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

8

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

Analytics

Tra

nsac

tions

Page 9: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

9

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tra

nsac

tions

Page 10: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

10

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tra

nsac

tions

Execution engine that spans the data spectrum

Page 11: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

11

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tra

nsac

tions

Page 12: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

12

Building Real-Time Data Pipelines for Immediate Impact

Page 13: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

By 2020, HP predicts that over a trillion sensors will be online

“The Internet of Things Will Drastically Change Our Future” – Datafloq

Page 14: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Going Real-Time is the Next Phase for Big Data

MoreDevices

More Interconnectivity

MoreUser Demand

…and companies are at risk of being left behind

Page 15: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

ExpensiveNot scalableBatch onlySAN-burdened

1%

Page 16: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Success will be driven by real-time analytic applications.

Page 17: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

17

Designing the Ideal Real-Time Pipeline

Message Queue Transformation Speed/Serving Layer

End-to-End Data Pipeline Under One Second

Page 18: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

18

A high-throughput distributed messaging system

Publish and subscribe to Kafka “topics”

Centralized data transport for the organization

Kafka

Page 19: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

19

In-memory execution engine

High level operators for procedural and programmatic analytics

Faster than MapReduce

Spark

Page 20: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

20

In-memory, distributed database

Full transactions and complete durability

Enable real-time, performant applications

MemSQL

Page 21: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

21

Use Spark and Operational Databases Together

Spark Operational Databases

Interface Programatic Declarative

Execution Environment Job Scheduler SQL Engine and Query Optimizer

Persistent Storage Use another system Built-in

Page 22: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

22

Subscribing to Kafka

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

0111001010101111101111100000001010111100001110101100000010010010111…

Publish to Kafka Topic

0111001010101111101111100000001010111100001110101100000010010010111…

1110010101000101010001010100010111111010100011110101100011010101000…

0101111000011100101010111110001111011010111100000000101110101100000…

Event added to message queue

Page 23: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

23

Enrich and Transform the Data

Spark polling Kafka for new messages

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

(2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60)

Deserialization

Enrichment

0111001010101111101111100000001010111100001110101100000010010010111…

Page 24: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

24

Persist and Prepare for Production

RDD.saveToMemSQL()

INSERT INTO memcity_table ...

timehouse_i

dzip

device_id

device_type watts

2015-07-

06T16:43:40.33

Z

329280

94110 23‘kitchen_app

liance’60

… … … … … …

Page 25: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

25

Go to Production

Compress development timelines

SELECT ... FROM memcity_table ...

Page 26: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

26

One Architecturefor Many Applications

Page 27: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

27

Lambda Applies to Real-Time Data Pipelines

Message Queue

Batch

Inputs DatabaseTransformation Application

Page 28: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

28

Kafka, Spark, and MemSQL Make it Simple

Batch

Inputs Application

Page 29: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Monitoring real-time Xfinity programming and video health

Page 30: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

30

Collect streaming data at scale (hundreds of MemSQL machines)

Proactively diagnose issues Query ad-hoc and in real-time

with full SQL

From 30 minutes to less than 1 second

Real-time Analytics

Page 31: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Real-Time Trend Analytics

Page 32: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

32

Massive Ingest and Concurrent Analytics

Instant accuracy to the latest repin Build real-time analytic applications

Real-time analytics

Page 33: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Watch the Pinterest Demo Video here: https://youtu.be/KXelkQFVz4E

Page 34: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

34

Real-Time

Segmentation

Page 35: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

35

Using Real-Time for Personalization

Ad Servers EC2

Real-time analytics

PostgreSQL

Legacy reportsMonitoring S3 (replay)

HDFS

Data Science

Vertica

Operational Data Store (ODS)

Star Schema MictoStrategy

Reach overlap and ad optimization Over 60,000 queries per second Millisecond response times

Page 36: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Page 37: Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

37

Thank You!

Visit MemSQL at Booth #518

Real-Time Demos T-Shirt GiveawayGames