Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
-
Upload
scylladb -
Category
Technology
-
view
1.291 -
download
0
Transcript of Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Streaming ETL in Kafka for Everyone with KSQL
Software Engineer, Confluent Inc.
Hojjat Jafarpour
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Hojjat Jafarpour
2
Software Engineer at Confluent ○Starter KSQL project at Confluent
Previously at Tidemark, Quantcast, Informatica and
NEC Labs
PhD in Computer Science from UC Irvine○Data management, pub/sub and streaming
@hojjat
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Streaming ETL, with Apache Kafka and Confluent Platform
3
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Kafka Connect : Stream data in and out of Kafka
8
Amazon
S3
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Single Message Transform (SMT)
9
▪ Modify events before storing in
Kafka:o Mask/drop sensitive informationo Set partitioning keyo Store lineage
▪ Modify events going out of
Kafka:o Route high priority events to faster
data storeso Direct events to different
Elasticsearch indexeso Cast data types to match destination
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
10
But I need to join…aggregate…filter
…
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
KSQL from Confluent
11
A Developer Preview of
KSQL
An Open Source Streaming SQL
Engine for Apache KafkaTM
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
KSQL: a Streaming SQL Engine for Apache Kafka™ from Confluent
▪ Enables stream processing with zero coding required
▪ The simplest way to process streams of data in real-time
▪ Powered by Kafka: scalable, distributed, battle-tested
▪ All you need is Kafka–No complex deployments of bespoke
systems for stream processing
12
Ksql>
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
KSQL: the Simplest Way to Do Stream Processing
CREATE STREAM possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
13
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
KSQL Concepts
▪ STREAM and TABLE as first-class citizens
o Interpretations of topic content
▪ STREAM - data in motion
▪ TABLE - collected state of a stream
o One record per key (per window)
o Current values (compacted topic) ← Not yet in KSQL
▪ STREAM – TABLE Joins
14
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Window Aggregations
Three types supported (same as KStreams):
● TUMBLING: Fixed-size, non-overlapping, gap-less windows
• SELECT ip, count(*) AS hits FROM clickstream
WINDOW TUMBLING (size 1 minute) GROUP BY ip;
● HOPPING: Fixed-size, overlapping windows
• SELECT ip, SUM(bytes) AS bytes_per_ip_and_bucket FROM clickstream
WINDOW HOPPING ( size 20 second, advance by 5 second) GROUP BY ip;
● SESSION: Dynamically-sized, non-overlapping, data-driven window
• SELECT ip, SUM(bytes) AS bytes_per_ip FROM clickstream
WINDOW SESSION (20 second) GROUP BY ip;
15
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Streaming ETL, powered by Apache Kafka and Confluent Platform
16
KSQL
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Simple Web Analytics Pipeline
● Pageview stream● User table● Materialized views
o Region visitor counto Region visitor demography
17
CREATE STREAM pageviews (viewtime BIGINT, userid VARCHAR, pageid VARCHAR) WITH
(kafka_topic='pageviews', value_format=JSON);
CREATE TABLE users (registertime BIGINT, gender VARCHAR, regionid VARCHAR, userid
VARCHAR) WITH (kafka_topic='users', value_format='JSON');
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Simple Web Analytics Pipeline
18
Region visitor count
CREATE STREAM joined_pageviews AS
SELECT users.userid AS userid, pageid, regionid, gender
FROM pageviews LEFT JOIN users ON pageviews.userid = users.userid;
CREATE TABLE region_visitor_count AS
SELECT regionid , COUNT(*) AS visit_count
FROM joined_pageviews
WINDOW TUMBLING (size 30 second)
GROUP BY regionid;
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Simple Web Analytics Pipeline
19
Region visitor demography
CREATE TABLE region_visitor_demo_count AS
SELECT regionid, gender, COUNT(*) AS visit_count
FROM joined_pageviews
WINDOW TUMBLING (size 30 second)
GROUP BY gender, regionid;
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Streaming ETL, powered by Apache Kafka and Confluent Platform
20
KSQL
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Confluent Platform: Enterprise Streaming based on Apache Kafka™
21
Database
ChangesLog Events loT Data
Web
Events…
CRM
Data Warehouse
Database
Hadoop
Data
Integration
…
Monitoring
Analytics
Custom Apps
Transformations
Real-time
Applications
…
Apache Open Source Confluent Open Source Confluent Enterprise
Confluent Platform
Confluent Platform
Apache Kafka™
Core | Connect API | Streams API
Data Compatibility
Schema Registry
Monitoring & Administration
Confluent Control Center | Security
Operations
Replicator | Auto Data Balancing
Development and Connectivity
Clients | Connectors | REST Proxy | KSQL | CLI
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Date to remember
22
• Kafka Summit 2018
• April 23-24 in London!
• More details:
https://kafka-summit.org/
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
THANK YOU
@hojjat
Please stay in touch
Any questions?
https://github.com/confluentinc/ksql/
https://www.confluent.io/download/