Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL

23
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Streaming ETL in Kafka for Everyone with KSQL Software Engineer, Confluent Inc. Hojjat Jafarpour

Transcript of Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Streaming ETL in Kafka for Everyone with KSQL

Software Engineer, Confluent Inc.

Hojjat Jafarpour

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Hojjat Jafarpour

2

Software Engineer at Confluent ○Starter KSQL project at Confluent

Previously at Tidemark, Quantcast, Informatica and

NEC Labs

PhD in Computer Science from UC Irvine○Data management, pub/sub and streaming

[email protected]

@hojjat

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Streaming ETL, with Apache Kafka and Confluent Platform

3

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

4

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

5

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

6

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

7

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Kafka Connect : Stream data in and out of Kafka

8

Amazon

S3

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Single Message Transform (SMT)

9

▪ Modify events before storing in

Kafka:o Mask/drop sensitive informationo Set partitioning keyo Store lineage

▪ Modify events going out of

Kafka:o Route high priority events to faster

data storeso Direct events to different

Elasticsearch indexeso Cast data types to match destination

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

10

But I need to join…aggregate…filter

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

KSQL from Confluent

11

A Developer Preview of

KSQL

An Open Source Streaming SQL

Engine for Apache KafkaTM

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

KSQL: a Streaming SQL Engine for Apache Kafka™ from Confluent

▪ Enables stream processing with zero coding required

▪ The simplest way to process streams of data in real-time

▪ Powered by Kafka: scalable, distributed, battle-tested

▪ All you need is Kafka–No complex deployments of bespoke

systems for stream processing

12

Ksql>

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

KSQL: the Simplest Way to Do Stream Processing

CREATE STREAM possible_fraud AS

SELECT card_number, count(*)

FROM authorization_attempts

WINDOW TUMBLING (SIZE 5 SECONDS)

GROUP BY card_number

HAVING count(*) > 3;

13

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

KSQL Concepts

▪ STREAM and TABLE as first-class citizens

o Interpretations of topic content

▪ STREAM - data in motion

▪ TABLE - collected state of a stream

o One record per key (per window)

o Current values (compacted topic) ← Not yet in KSQL

▪ STREAM – TABLE Joins

14

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Window Aggregations

Three types supported (same as KStreams):

● TUMBLING: Fixed-size, non-overlapping, gap-less windows

• SELECT ip, count(*) AS hits FROM clickstream

WINDOW TUMBLING (size 1 minute) GROUP BY ip;

● HOPPING: Fixed-size, overlapping windows

• SELECT ip, SUM(bytes) AS bytes_per_ip_and_bucket FROM clickstream

WINDOW HOPPING ( size 20 second, advance by 5 second) GROUP BY ip;

● SESSION: Dynamically-sized, non-overlapping, data-driven window

• SELECT ip, SUM(bytes) AS bytes_per_ip FROM clickstream

WINDOW SESSION (20 second) GROUP BY ip;

15

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Streaming ETL, powered by Apache Kafka and Confluent Platform

16

KSQL

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Simple Web Analytics Pipeline

● Pageview stream● User table● Materialized views

o Region visitor counto Region visitor demography

17

CREATE STREAM pageviews (viewtime BIGINT, userid VARCHAR, pageid VARCHAR) WITH

(kafka_topic='pageviews', value_format=JSON);

CREATE TABLE users (registertime BIGINT, gender VARCHAR, regionid VARCHAR, userid

VARCHAR) WITH (kafka_topic='users', value_format='JSON');

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Simple Web Analytics Pipeline

18

Region visitor count

CREATE STREAM joined_pageviews AS

SELECT users.userid AS userid, pageid, regionid, gender

FROM pageviews LEFT JOIN users ON pageviews.userid = users.userid;

CREATE TABLE region_visitor_count AS

SELECT regionid , COUNT(*) AS visit_count

FROM joined_pageviews

WINDOW TUMBLING (size 30 second)

GROUP BY regionid;

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Simple Web Analytics Pipeline

19

Region visitor demography

CREATE TABLE region_visitor_demo_count AS

SELECT regionid, gender, COUNT(*) AS visit_count

FROM joined_pageviews

WINDOW TUMBLING (size 30 second)

GROUP BY gender, regionid;

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Streaming ETL, powered by Apache Kafka and Confluent Platform

20

KSQL

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Confluent Platform: Enterprise Streaming based on Apache Kafka™

21

Database

ChangesLog Events loT Data

Web

Events…

CRM

Data Warehouse

Database

Hadoop

Data

Integration

Monitoring

Analytics

Custom Apps

Transformations

Real-time

Applications

Apache Open Source Confluent Open Source Confluent Enterprise

Confluent Platform

Confluent Platform

Apache Kafka™

Core | Connect API | Streams API

Data Compatibility

Schema Registry

Monitoring & Administration

Confluent Control Center | Security

Operations

Replicator | Auto Data Balancing

Development and Connectivity

Clients | Connectors | REST Proxy | KSQL | CLI

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Date to remember

22

• Kafka Summit 2018

• April 23-24 in London!

• More details:

https://kafka-summit.org/

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

THANK YOU

[email protected]

@hojjat

Please stay in touch

Any questions?

https://github.com/confluentinc/ksql/

https://www.confluent.io/download/