ETL is dead; long-live streams

103
ETL is dead; long-live streams Neha Narkhede, Co-founder & CTO, Confluent

Transcript of ETL is dead; long-live streams

Page 1: ETL is dead; long-live streams

ETL is dead; long-live streams

Neha Narkhede, Co-founder & CTO, Confluent

Page 2: ETL is dead; long-live streams

“Data and data systems have really

changed in the past decade

Page 3: ETL is dead; long-live streams

Old world: Two popular locations for data

Operational databases Relational data warehouse

DB

DB

DB

DB DWH

Page 4: ETL is dead; long-live streams

“Several recent data trends are driving a dramatic change in the ETL architecture

Page 5: ETL is dead; long-live streams

“#1: Single-server databases are replaced by a myriad of distributed data

platforms that operate at company-wide scale

Page 6: ETL is dead; long-live streams

“#2: Many more types of data sources beyond transactional data - logs, sensors,

metrics...

Page 7: ETL is dead; long-live streams

“#3: Stream data is increasingly ubiquitous; need for faster processing

than daily

Page 8: ETL is dead; long-live streams

“The end result? This is what data integration ends up looking like in

practice

Page 9: ETL is dead; long-live streams

App App App App

search

HadoopDWH

monitoring securityMQ MQ

cachecache

Page 10: ETL is dead; long-live streams

A giant mess!

App App App App

search

HadoopDWH

monitoring securityMQ MQ

cachecache

Page 11: ETL is dead; long-live streams

“We will see how transitioning to streams

cleans up this mess and works towards...

Page 12: ETL is dead; long-live streams

Streaming platform

DWH Hadoop

security

App App App App

search

NoSQL

monitoring

request-response

messaging ORstream processing

streaming data pipelines

changelogs

Page 13: ETL is dead; long-live streams

A short history of data integration

Page 14: ETL is dead; long-live streams

“Surfaced in the 1990s in retail

organizations for analyzing buyer trends

Page 15: ETL is dead; long-live streams

“Extract data from databases

Transform into destination warehouse schema

Load into a central data warehouse

Page 16: ETL is dead; long-live streams

“BUT … ETL tools have been around for a long time, data coverage in data

warehouses is still low! WHY?

Page 17: ETL is dead; long-live streams

Etl has drawbacks

Page 18: ETL is dead; long-live streams

“#1: The need for a global schema

Page 19: ETL is dead; long-live streams

“#2: Data cleansing and curation is manual and fundamentally error-prone

Page 20: ETL is dead; long-live streams

“#3: Operational cost of ETL is high; it is slow; time and resource intensive

Page 21: ETL is dead; long-live streams

“#4: ETL tools were built to narrowly focus on connecting databases and the data warehouse in a batch fashion

Page 22: ETL is dead; long-live streams

“Early take on real-time ETL =

Enterprise Application Integration (EAI)

Page 23: ETL is dead; long-live streams

“EAI: A different class of data integration technology for connecting applications in

real-time

Page 24: ETL is dead; long-live streams

“EAI employed Enterprise Service Buses

and MQs; weren’t scalable

Page 25: ETL is dead; long-live streams

ETL and EAI are outdated!

Page 26: ETL is dead; long-live streams

Old world: scale or timely data, pick one

real-

time

scale

batc

hEAI

ETL

real-time BUTnot scalable

scalable BUT batch

Page 27: ETL is dead; long-live streams

“Data integration and ETL in the modern world need a complete revamp

Page 28: ETL is dead; long-live streams

new world: streaming, real-time and scalable

real-

time

scale

EAI

ETL

StreamingPlatform

real-time BUTnot scalable

real-timeAND

scalable

scalable BUT batch

batc

h

Page 29: ETL is dead; long-live streams

“Modern streaming world has new set of requirements for data integration

Page 30: ETL is dead; long-live streams

“#1: Ability to process high-volume and high-diversity data

Page 31: ETL is dead; long-live streams

“#2 Real-time from the grounds up; a fundamental transition to event-centric thinking

Page 32: ETL is dead; long-live streams

Event-Centric Thinking

Streaming Platform

“A product was viewed”

HadoopWeb app

Page 33: ETL is dead; long-live streams

Event-Centric Thinking

Streaming Platform

“A product was viewed”

HadoopWeb app

mobile app

APIs

Page 34: ETL is dead; long-live streams

mobile app

web app

APIs

Streaming Platform

Hadoop

Security

Monitoring

Rec engine“A product was viewed”

Event-Centric Thinking

Page 35: ETL is dead; long-live streams

“Event-centric thinking, when applied at a company-wide scale, leads to this

simplification ...

Page 36: ETL is dead; long-live streams

Streaming platform

DWH Hadoop

App

App App App App

App

App

App

request-response

messaging ORstream processing

streaming data pipelines

changelogs

Page 37: ETL is dead; long-live streams

“#3: Enable forward-compatible data architecture; the ability to add more applications that need to process the

same data … differently

Page 38: ETL is dead; long-live streams

“To enable forward compatibility, redefine the T in ETL:

Clean data in; Clean data out

Page 39: ETL is dead; long-live streams

app logs app logs

app logs

app logs

#1: Extract as unstructured text

#2: Transform1 = data cleansing = “what is a product view”

#4: Transform2 = drop PII fields”

#3: Load into DWH

DWH

Page 40: ETL is dead; long-live streams

#1: Extract as unstructured text

#2: Transform1 = data cleansing = “what is a product view”

#4: Transform2 = drop PII fields”

DWH

#2: Transform1 = data cleansing = “what is a product view”

#4: Transform2 = drop PII fields”

Cassandra

#1: Extract as unstructured textagain

#3: Load cleansed data

#3: Load cleansed data

Page 41: ETL is dead; long-live streams

#1: Extract as structured product view events

#2: Transforms = drop PII fields”

#4:.1 Load product view stream

#4: Load filtered product View stream

DWH

Cassandra

StreamingPlatform

#4.2 Load filtered productview stream

Page 42: ETL is dead; long-live streams

“To enable forward compatibility, redefine the T in ETL:

Data transformations, not data cleansing!

Page 43: ETL is dead; long-live streams

#1: Extract once as structured product view events

#2: Transform once = drop PII fields” and enrich with product metadata #4.1: Load product

views stream

#4: Load filtered and

enriched product views stream

DWH

Cassandra

StreamingPlatform

#4.2: Load filtered and enriched product views stream

Page 44: ETL is dead; long-live streams

“Forward compatibility = Extract clean-data once; Transform many

different ways before Loading into respective destinations … as and when required

Page 45: ETL is dead; long-live streams

“In summary, needs of modern data integration solution?

Scale, diversity, latency and forward compatibility

Page 46: ETL is dead; long-live streams

Requirements for a modern streaming data integration solution

- Fault tolerance- Parallelism- Latency- Delivery semantics- Operations and

monitoring- Schema management

Page 47: ETL is dead; long-live streams

Data integration: platform vs tool

Central, reusable infrastructure for many use cases

One-off, non-reusable solution for a particular use case

Page 48: ETL is dead; long-live streams

New shiny future of etl: a streaming platform

NoSQL

RDBMS

Hadoop

DWH

Apps Apps Apps

Search Monitoring RT analytics

Page 49: ETL is dead; long-live streams

“Streaming platform serves as the central nervous system for a

company’s data in the following ways ...

Page 50: ETL is dead; long-live streams

“#1: Serves as the real-time, scalable messaging bus for applications; no

EAI

Page 51: ETL is dead; long-live streams

“#2: Serves as the source-of-truth pipeline for feeding all data processing

destinations; Hadoop, DWH, NoSQL systems and more

Page 52: ETL is dead; long-live streams

“#3: Serves as the building block for stateful stream processing

microservices

Page 53: ETL is dead; long-live streams

“Batch data integration

Streaming

Page 54: ETL is dead; long-live streams

“Batch ETL

Streaming

Page 55: ETL is dead; long-live streams

a short history of data integration

drawbacks of ETL

needs and requirements for a streaming platform

new, shiny future of ETL: a streaming platform

What does a streaming platform look like and how it enables Streaming ETL?

Page 56: ETL is dead; long-live streams

Apache kafka: a distributed streaming platform

Page 57: ETL is dead; long-live streams

57Confidential

Apache kafka 6 years ago

57

Page 58: ETL is dead; long-live streams

58Confidential

> 1,400,000,000,000 messages processed / day

58

Page 59: ETL is dead; long-live streams

Now Adopted at 1000s of companies worldwide

Page 60: ETL is dead; long-live streams

“What role does Kafka play in the new

shiny future for data integration?

Page 61: ETL is dead; long-live streams

“#1: Kafka is the de-facto storage of choice for stream data

Page 62: ETL is dead; long-live streams

The log

0 1 2 3 4 5 6 7

next write

reader 1 reader 2

Page 63: ETL is dead; long-live streams

The log & pub-sub

0 1 2 3 4 5 6 7

publisher

subscriber 1 subscriber 2

Page 64: ETL is dead; long-live streams

“#2: Kafka offers a scalable messaging backbone for application

integration

Page 65: ETL is dead; long-live streams

Kafka messaging APIs: scalable eai

app

Messaging APIsproduce(message) consume(message)

Page 66: ETL is dead; long-live streams

“#3: Kafka enables building streaming data pipelines (E & L in ETL)

Page 67: ETL is dead; long-live streams

Kafka’s Connect API: Streaming data ingestion

app

Messaging APIsMessaging APIs

Conn

ect A

PI

Conn

ect A

PI

app

source sink

Extract Load

Page 68: ETL is dead; long-live streams

“#4: Kafka is the basis for stream processing and transformations

Page 69: ETL is dead; long-live streams

Kafka’s streams API: stream processing (transforms)

Messaging API

Streams API

apps

apps

Conn

ect A

PI

Conn

ect A

PI

source sink

Extract Load

Transforms

Page 70: ETL is dead; long-live streams

Kafka’s connect API =

E and L in Streaming ETL

Page 71: ETL is dead; long-live streams

Connectors!

NoSQL

RDBMS

Hadoop

DWH

Search MonitoringRT analytics

Apps Apps Apps

Page 72: ETL is dead; long-live streams

How to keep data centers in-sync?

Page 73: ETL is dead; long-live streams

Sources and sinks

Conn

ect A

PI

Conn

ect A

PI

source sink

Extract Load

Page 74: ETL is dead; long-live streams

changelogs

Page 75: ETL is dead; long-live streams

Transforming changelogs

Page 76: ETL is dead; long-live streams

Kafka’s Connect API = Connectors Made Easy!

- Scalability: Leverages Kafka for scalability- Fault tolerance: Builds on Kafka’s fault tolerance model- Management and monitoring: One way of monitoring all

connectors- Schemas: Offers an option for preserving schemas

from source to sink

Page 77: ETL is dead; long-live streams

Kafka all the things!

Conn

ect A

PI

Page 78: ETL is dead; long-live streams

Kafka’s streams API =

The T in STREAMING ETL

Page 79: ETL is dead; long-live streams

“Stream processing =

transformations on stream data

Page 80: ETL is dead; long-live streams

2 visions for stream processing

Real-time Mapreduce Event-driven microservicesVS

Page 81: ETL is dead; long-live streams

2 visions for stream processing

Real-time Mapreduce Event-driven microservicesVS- Central cluster- Custom packaging,

deployment & monitoring

- Suitable for analytics-type use cases

- Embedded library in any Java app

- Just Kafka and your app

- Makes stream processing accessible to any use case

Page 82: ETL is dead; long-live streams

Vision 1: real-time mapreduce

Page 83: ETL is dead; long-live streams

Vision 2: event-driven microservices => Kafka’s streams API

Streams API

microservice

Transforms

Page 84: ETL is dead; long-live streams

“Kafka’s Streams API = Easiest way to do

stream processing using Kafka

Page 85: ETL is dead; long-live streams

“#1: Powerful and lightweight Java library; need just Kafka and your app

app

Page 86: ETL is dead; long-live streams

“#2: Convenient DSL with all sorts of operators: join(), map(), filter(), windowed

aggregates etc

Page 87: ETL is dead; long-live streams

Word count program using Kafka’s streams API

Page 88: ETL is dead; long-live streams

“#3: True event-at-a-time stream processing; no microbatching

Page 89: ETL is dead; long-live streams

“#4: Dataflow-style windowing based on event-time; handles late-arriving data

Page 90: ETL is dead; long-live streams

“#5: Out-of-the-box support for local state; supports fast stateful processing

Page 91: ETL is dead; long-live streams

External state

Page 92: ETL is dead; long-live streams

local state

Page 93: ETL is dead; long-live streams

Fault-tolerant local state

Page 94: ETL is dead; long-live streams

“#6: Kafka’s Streams API allows reprocessing; useful to upgrade apps or

do A/B testing

Page 95: ETL is dead; long-live streams

reprocessing

Page 96: ETL is dead; long-live streams

Real-time dashboard for security monitoring

Page 97: ETL is dead; long-live streams

Kafka’s streams api: simple is beautiful

Vision 1

Vision 2

Page 98: ETL is dead; long-live streams

Logs unify batch and stream processing

Page 99: ETL is dead; long-live streams

Streams API

app

sinksource

Conn

ect A

PI

Conn

ect A

PI

Transforms

LoadExtract

New shiny future of ETL: Kafka

Page 100: ETL is dead; long-live streams

A giant mess!

App App App App

search

HadoopDWH

monitoring securityMQ MQ

cachecache

Page 101: ETL is dead; long-live streams

All your data … everywhere … now

Streaming platform

DWH Hadoop

App

App App App App

App

App

App

request-response

messaging ORstream processing

streaming data pipelines

changelogs

Page 102: ETL is dead; long-live streams

VISION: All your data … everywhere … now

Streaming platform

DWH Hadoop

App

App App App App

App

App

App

request-response

messaging ORstream processing

streaming data pipelines

changelogs

Page 103: ETL is dead; long-live streams

Thank you!@nehanarkhede