From a kafkaesque story to The Promised Land

30
DATA From a Kafkaesque Story to the Promised Land 7/7/2013 Ran Silberman

description

LivePerson moved from an ETL based data platform to a new data platform based on emerging technologies from the Open Source community: Hadoop, Kafka, Storm, Avro and more. This presentation tells the story and focuses on Kafka.

Transcript of From a kafkaesque story to The Promised Land

Page 1: From a kafkaesque story to The Promised Land

DATA

From a Kafkaesque Story to the Promised Land

7/7/2013Ran Silberman

Page 2: From a kafkaesque story to The Promised Land

Open Source paradigm

The Cathedral & the Bazaar by Eric S Raymond, 1999the struggle between top-down and bottom-up design

Page 3: From a kafkaesque story to The Promised Land

Challenges of data platform[1]

• High throughput

• Horizontal scale to address growth

• High availability of data services

• No Data loss

• Satisfy Real-Time demands

• Enforce structural data with schemas

• Process Big Data and Enterprise Data

• Single Source of Truth (SSOT)

Page 4: From a kafkaesque story to The Promised Land

SLA's of data platform

BI DWH

Real-time Customers

Real-time dashboards

Data Bus

Offline Customers

SLA:1. 98% in < 1/2 hr2. 99.999% < 4 hrs

SLA:1. 98% in < 500 msec2. No send > 2 sec

Real-time servers

Page 5: From a kafkaesque story to The Promised Land

Legacy Data flow in LivePerson

BI DWH (Oracle)

RealTime servers

View Reports

Customers

ETL

Sessionize

Modeling

Schema View

Page 6: From a kafkaesque story to The Promised Land

1st phase - move to Hadoop

ETL

Sessionize

Modeling

Schema View

RealTime servers

BI DWH (Vertica)

View Reports

HDFS

Hadoop

MR Job transfers data to BI DWH

Customers

Page 7: From a kafkaesque story to The Promised Land

2. move to Kafka

6

RealTime servers

HDFSBI DWH (Vertica)

Hadoop

MR Job transfers data to BI DWH

KafkaTopic-1

View Reports

Customers

Page 8: From a kafkaesque story to The Promised Land

3. Integrate with new producers

6

RealTime servers

HDFSBI DWH (Vertica)

Hadoop

MR Job transfers data to BI DWH

KafkaTopic-1 Topic-2

New RealTime servers

View Reports

Customers

Page 9: From a kafkaesque story to The Promised Land

4. Add Real-time BI

View Reports

6

Customers

RealTime servers

HDFSBI DWH (Vertica)

Hadoop

MR Job transfers data to BI DWH

KafkaTopic-1 Topic-2

New RealTime servers

Storm

Topology

Page 10: From a kafkaesque story to The Promised Land

5. Standardize Data-Model using Avro

View Reports

6

Customers

RealTime servers

HDFSBI DWH (Vertica)

Hadoop

MR Job transfers data to BI DWH

KafkaTopic-1 Topic-2

New RealTime servers

Storm

Topology

Camus

Page 11: From a kafkaesque story to The Promised Land

6. Define Single Source of Truth (SSOT)

View Reports

6

Customers

RealTime servers

HDFSBI DWH (Vertica)

Hadoop

MR Job transfers data to BI DWH

KafkaTopic-1 Topic-2

New RealTime servers

Storm

Topology

Camus

Page 12: From a kafkaesque story to The Promised Land

Kafka[2] as Backbone for Data

• Central "Message Bus"

• Support multiple topics (MQ style)

• Write ahead to files

• Distributed & Highly Available

• Horizontal Scale

• High throughput (10s MB/Sec per server)

• Service is agnostic to consumers' state

• Retention policy

Page 13: From a kafkaesque story to The Promised Land

Kafka Architecture

Page 14: From a kafkaesque story to The Promised Land

Kafka Architecture cont.

Node 1

Zookeeper

Producer 1 Producer 2 Producer 3

Node 2 Node 3

Consumer 1 Consumer 1Consumer 1

Page 15: From a kafkaesque story to The Promised Land

Group1

Kafka Architecture cont.

Node 1

Zookeeper

Producer 1 Producer 2

Node 3 Node 4

Consumer 2 Consumer 3Consumer 1

Node 2

Topic1 Topic2

Page 16: From a kafkaesque story to The Promised Land

Kafka replay messages.

Zookeeper

Node 3 Node 4

Min Offset ->

Max Offset ->

fetchRequest = new fetchRequest(topic, partition, offset, size);

currentOffset : taken from zookeeperEarliest offset: -2 Latest offset : -1

Page 17: From a kafkaesque story to The Promised Land

Kafka API[3]

• Producer API

• Consumer API

o High-level API

using zookeeper to access brokers and to save

offsets

o SimpleConsumer API

direct access to Kafka brokers

• Kafka-Spout, Camus, and KafkaHadoopConsumer all

use SimpleConsumer

Page 18: From a kafkaesque story to The Promised Land

Kafka API[3]

• Producermessages = new List<KeyedMessage<K, V>>()

Messages.add(new KeyedMessage(“topic1”, null, msg1));

Send(messages);

• Consumerstreams[] = Consumer.createMessageStream((“topic1”, 1);

for (message: streams[0]{

//do something with message

}

Page 19: From a kafkaesque story to The Promised Land

Kafka in Unit Testing• Use of class KafkaServer

• Run embedded server

Page 20: From a kafkaesque story to The Promised Land

Introducing Avro[5]

• Schema representation using JSON

• Support types

o Primitive types: boolean, int, long, string, etc.

o Complex types: Record, Enum, Union, Arrays,

Maps, Fixed

• Data is serialized using its schema

• Avro files include file-header of the schema

Page 21: From a kafkaesque story to The Promised Land

Add Avro protocol to the story

Topic 1

Schema Repo

Producer 1

Topic 2

Consumers: Camus/Storm

Create Message according to Schema 1.0

Register schema 1.0Add revision to message header

Send message

Read message Extract header and obtain schema version

Get schema by version 1.0

Encode message with Schema 1.0

Decode message with schema 1.0

{event1:{header:{sessionId:"102122"),{timestamp:"12346")}...

Header Avro message

Kafka message

Pass message

1.0

Page 22: From a kafkaesque story to The Promised Land

Kafka + Storm + Avro example

• Demonstrating use of Avro data passing from Kafka to

Storm

• Explains Avro revision evolution

• Requires Kafka and Zookeeper installed

• Uses Storm artifact and Kafka-Spout artifact in Maven

• Plugin generates Java classes from Avro Schema

• https://github.com/ransilberman/avro-kafka-storm

Page 23: From a kafkaesque story to The Promised Land

Producer machine

Resiliency

Producer

Consistent Topic

Send message to Kafka

local file

Persist message to local disk

Kafka Bridge

Send message to Kafka

Fast Topic

Real-time Consumer: Storm

Offline Consumer: Hadoop

Page 24: From a kafkaesque story to The Promised Land

Challenges of Kafka

• Still not mature enough

• Not enough supporting tools (viewers, maintenance)

• Duplications may occur

• API not documented enough

• Open Source - support by community only

• Difficult to replay messages from specific point in time

• Eventually Consistent...

Page 25: From a kafkaesque story to The Promised Land

Eventually Consistent

Because it is a distributed system -

• No guarantee for delivery order

• No way to tell to which broker message is sent

• Kafka do not guarantee that there are no duplications

• ...But eventually, all message will arrive!

Event generated

Event destination

Desert

Page 26: From a kafkaesque story to The Promised Land

Major Improvements in Kafka 0.8[4]

• Partitions replication

• Message send guarantee

• Consumer offsets are represented numbers instead of

bytes (e.g., 1, 2, 3, ..)

Page 27: From a kafkaesque story to The Promised Land

Addressing Data Challenges

• High throughput

o Kafka, Hadoop

• Horizontal scale to address growth

o Kafka, Storm, Hadoop

• High availability of data services

o Kafka, Storm, Zookeeper

• No Data loss

o Highly Available services, No ETL

Page 28: From a kafkaesque story to The Promised Land

Addressing Data Challenges Cont.

• Satisfy Real-Time demands

o Storm

• Enforce structural data with schemas

o Avro

• Process Big Data and Enterprise Data

o Kafka, Hadoop

• Single Source of Truth (SSOT)

o Hadoop, No ETL

Page 29: From a kafkaesque story to The Promised Land

References

• [1]

Satisfying new requirements for Data Integration By Dav

id Loshin

• [2]Apache Kafka

• [3]Kafka API

• [4]Kafka 0.8 Quick Start

• [5]Apache Avro

• [5]Storm

Page 30: From a kafkaesque story to The Promised Land

Thank you!

DATA