Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
-
Upload
memsql -
Category
Data & Analytics
-
view
161 -
download
1
Transcript of Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Eric Frenkiel, MemSQL CEO and co-founder
August 11, 2015 • San Diego, CA
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
2
What’s In Store
MemSQL and a fresh look at Lambda architectures
Building real-time data pipelines for immediate impact
One architecture for many applications
3
MemSQL at a Glance
• Enable every company to be a real-time enterprise
• Founded 2011, based in San Francisco
• Founders are ex-Facebook, SQL Server engineers
• Deliver a database technology for modern architecture
Enterprise Focus
4
The Real-Time Database for Transactions and Analytics
In-Memory Distributed Relational
Data CenterSoftware Cloud
5
Speed
Serving
Batch Fast Updates
Unified queries, full SQL
Fast Appends
A Fresh Look at Lambda Architectures
6
Comprehensive Architecture
Tra
nsac
tions
7
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
RowstoreTra
nsac
tions
8
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
Analytics
Tra
nsac
tions
9
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch Layer
Fast Appends
Columnstore
Analytics
Tra
nsac
tions
10
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch Layer
Fast Appends
Columnstore
Analytics
Tra
nsac
tions
Execution engine that spans the data spectrum
11
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch Layer
Fast Appends
Columnstore
Analytics
Tra
nsac
tions
12
Building Real-Time Data Pipelines for Immediate Impact
By 2020, HP predicts that over a trillion sensors will be online
“The Internet of Things Will Drastically Change Our Future” – Datafloq
Going Real-Time is the Next Phase for Big Data
MoreDevices
More Interconnectivity
MoreUser Demand
…and companies are at risk of being left behind
ExpensiveNot scalableBatch onlySAN-burdened
1%
Success will be driven by real-time analytic applications.
17
Designing the Ideal Real-Time Pipeline
Message Queue Transformation Speed/Serving Layer
End-to-End Data Pipeline Under One Second
18
A high-throughput distributed messaging system
Publish and subscribe to Kafka “topics”
Centralized data transport for the organization
Kafka
19
In-memory execution engine
High level operators for procedural and programmatic analytics
Faster than MapReduce
Spark
20
In-memory, distributed database
Full transactions and complete durability
Enable real-time, performant applications
MemSQL
21
Use Spark and Operational Databases Together
Spark Operational Databases
Interface Programatic Declarative
Execution Environment Job Scheduler SQL Engine and Query Optimizer
Persistent Storage Use another system Built-in
22
Subscribing to Kafka
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
0111001010101111101111100000001010111100001110101100000010010010111…
Publish to Kafka Topic
0111001010101111101111100000001010111100001110101100000010010010111…
1110010101000101010001010100010111111010100011110101100011010101000…
0101111000011100101010111110001111011010111100000000101110101100000…
Event added to message queue
23
Enrich and Transform the Data
Spark polling Kafka for new messages
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
(2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60)
Deserialization
Enrichment
0111001010101111101111100000001010111100001110101100000010010010111…
24
Persist and Prepare for Production
RDD.saveToMemSQL()
INSERT INTO memcity_table ...
timehouse_i
dzip
device_id
device_type watts
2015-07-
06T16:43:40.33
Z
329280
94110 23‘kitchen_app
liance’60
… … … … … …
25
Go to Production
Compress development timelines
SELECT ... FROM memcity_table ...
26
One Architecturefor Many Applications
27
Lambda Applies to Real-Time Data Pipelines
Message Queue
Batch
Inputs DatabaseTransformation Application
28
Kafka, Spark, and MemSQL Make it Simple
Batch
Inputs Application
Monitoring real-time Xfinity programming and video health
30
Collect streaming data at scale (hundreds of MemSQL machines)
Proactively diagnose issues Query ad-hoc and in real-time
with full SQL
From 30 minutes to less than 1 second
Real-time Analytics
Real-Time Trend Analytics
32
Massive Ingest and Concurrent Analytics
Instant accuracy to the latest repin Build real-time analytic applications
Real-time analytics
Watch the Pinterest Demo Video here: https://youtu.be/KXelkQFVz4E
34
Real-Time
Segmentation
35
Using Real-Time for Personalization
Ad Servers EC2
Real-time analytics
PostgreSQL
Legacy reportsMonitoring S3 (replay)
HDFS
Data Science
Vertica
Operational Data Store (ODS)
Star Schema MictoStrategy
Reach overlap and ad optimization Over 60,000 queries per second Millisecond response times
37
Thank You!
Visit MemSQL at Booth #518
Real-Time Demos T-Shirt GiveawayGames