Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Real-time Data Integration with Apache

Flink & Kafka @Bouygues Telecom

Mohamed Amine ABDESSEMED

Flink Forward, Berlin, October 2015

About Me

• Software engineer & solution architect @ Bouygues

Telecom

• My daily Toolbox

– Hadoop ecosystem

– Apache Flink

– Apache Kafka

– Elasticsearch

– Apache Camel

– And more.

• If I don’t see me coding, I am probably outside running

2

Outline

• Who we are

• Logged User eXperience

• Challenges

• Typical Data Flow pipeline on Hadoop

• Real-time Data Integration

• Apache Flink : The Elegant

• Data Integration use case

• What we loved using Flink

• What’s Next ?

3

BOUYGUES TELECOM

14M Clients

11,4M Mobile

subscriber

2,6M Fixed customer

A very Innovative company

Leader 4G/4G+/UHMD

First Android based TV BOX

4

Mobile . Fixed . TV . Internet . Cloud

BOUYGUES TELECOM

2135819137

14479

10595

BouyguesTelecom

Orange Free SFR

nPerf Mobile Data Networks Global score 2G/3G/4G (Q3-

2015)

Bouygues Telecom

Orange

Free

SFR

71% 72%

33%

53%

BouyguesTelecom

Orange Free SFR

Population covered in 4G

Bouygues Telecom Orange Free SFR

5

AT THE HUB OF OUR 14 MILLION CUSTOMERS' DIGITAL

LIVES

AND WE GIVE THEMGENUINEREASONS TO STAY LOYAL

LUX: Logged User eXperienceMobile QoE

• Produce Mobile QoE indicators from massive

network equipment’s event logs (4 Billions/day).

• Goals:

– QoE (User) instead of QoS (Machine).

– Real-time Diagnostic (<60sec. end-to-end latency)

– Business Intelligence

– Real-time alarming

– Reporting

7

LUX: Logged User eXperienceMobile QoE

log

s

log

s

log

s

log

s

log

s

log

s

log

s

log

s

8

Challenges

Challenges

1. Data movement

9

log

s

log

s

log

s

log

s

log

s

log

s

log

s

log

s

Challenges

1. Data movement

2. Data Processing

– Data is generally too raw to be used directly.

– How can we transform it ?

– How can we make the results available as soon as possible ?

10

Typical Data Flow pipeline on Hadoop

HADOOP

FTP

HTTP

SCOOP

FLUME

FS PUT

Client

1

Client

2

Client

...

Client

x

Impala

Hive

Hbase

TO

System 1

System 2

System ...

System x

Raw

Data

DFS

Enriched

Data

V1

DFS

Enriched

Data

V...

DFS

Enriched

Data

Vx

DFS

Data

source 1

Data

Source 2

Data

Source x

Data

Source ...

Batch Batch Batch

11

12

THE WORLD MOVES FAST

DATA MUST MOVE FASTER

Real-time Data Integration

• Inspired by Linkedin’s Kafka Data

Integration design pattern.

• Take all the data and it into a central log

repository for real-time subscription.

What if the data is too raw to

used, even binary encoded with

no visible business logic

information?13


Solution 01: Process Data before pushing it to Kafka

• Not viable:

– Data sources have limited computation resources dedicated to

log collection.

– Not scalable.

– Too hard to maintain.

We’ve to push it RAW.

14



• Solution 02: The consumers/Data subscribers have to

process Data before using it.

– Drawbacks:

• All the consumers must implement the same business logic, run it

against the same Data.

• Any changes/updates in the processing logic will require an update of

all the consumers.

Provide a useable Data format to each consumer/Data

subscriber.

15



Solution 02: The consumers/Data subscribers have to process Data before using it

Solution 03: Process Kafka’s raw Data and push it back in

decoded/enriched format for subscribers.

Benefits:

– The business logic will be implemented in one place.

– Resource efficient.

– Data subscribers can focus more on their own business logic.

– Simple handling of sources/clients evolution.

Challenges:

– Keep the Data moving real time.

– The Data processing pipeline must be very fast

16


Client 1

Client 2

Client...

Client x

HADOOP

Impala

Hive

Hbase

Kafka TOPIC RAW

Kafka TOPIC V1

Kafka TOPIC V….

Kafka TOPIC Vx

SYS

1TO

SYS

2

SYS

...

TO

Collector

App

Kafka

Producer

ARC

HIV

E

SYS

x

Enriched

Data

Vx

DFS

Raw

Data

DFS

Enriched

Data

V1

DFS

Enriched

Data

V...

DFS

Data

source 1

Data

Source 2

Data

Source x

Data

Source ...

Streaming

Streaming

Streaming

17

Real-time Data Integration with

Apache Kafka and Spark?

• Started a POC on Spark Streaming.

• Didn’t answer our needs:

– Poor back pressure handling

Jobs kept failing OOM on busy hours.

– Micro-batching & Latency.

– Many configuration parameters:

– The tested version used an HDFS WAL as fault tolerance

mechanism but this should be handled by Kafka.

18

Apache Flink : The Elegant

• True streaming, no more micro-batching !

• Nice back pressure handling.

• Fault tolerant, exactly once processing.

• High-Throughput.

• Scalable.

An open source platform for distributed stream and batch data processing.

• Rich functional APIs.

• (Almost) no constraints on serialization.

• Control of parallelism at all execution levels.

• Flexibility and ease of extension.

• And more nice stuffs.

19

IoT / Mobile Applications

20

Events occur on devices

Queue / Log

Events analyzed in a

data streaming

system

Stream Analysis

Events stored in a log

LUX: Logged User eXperience

LUX: Logged User eXperience

4 Billions

raw events/day

700 GB/day

(Raw Data / snappy

compressed)

100 Data

Sources

6 Main

Data Types

26 Kafka Topics

• 6 raw

• 20 enriched

CDH5 Cluster

2 Brokers Kafka

Cluster

20 DataNodes

750TB

22

23

Mobile CDR use case

Client 1

Client 2

Client...

Client x

HADOOP

CDR_BIN

CDR_DECODED

CDR_ENRICHED

CDR_ENRICHED_ELASTIC

Planck-

Collector

DECODED

CDR

HDFS

Elast icsearch

2 Weeks retention

Other IT Systems (

commercial, ...)

KPI

Views

HDFS

Each Machine

generates a binary file

every 5min or 2MB

Binary

Decoder

Common

CDR

Enricher

Lookup Live

Reference Data

Alarms/Live

Counters

Zabbix

Elast icsearch

Formater

15 min

Window

Counters

Historical

Data

ENRICHED

CDR

HDFS

BINARY

CDR

HDFS

REFERENCE

DATA

HDFS

K2ES

LOGSTASH

Kafka

Mirroring

Lookup Live

Reference Data

Network

equipment

Network

equipment

Network

equipment

Network

equipment

Network

equipment

Network

equipment

And it Rocks !

• We ran stress tests on our biggest raw Kafka topic:

– A day of Data.

– 2 Billions events (480Gib compressed).

– 10 Kafka partitions

– 10 Flink TaskManagers (Only 1GB Memory each)

Enrich Rate (in Tickets/second)Total Processing time (ms)

Kafka I/O Duration (ms)

24

And it Rocks !

• We ran stress tests on our biggest raw Kafka topic:

– A day of Data.

– 2 Billions events (480Gib compressed).

– 10 Kafka partitions

– 10 Flink TaskManagers (Only 1GB Memory each)

Total Processing time (ms)

Kafka I/O Duration (ms)

25

500 000 events/sec

1 day data

processed in 1h.

Enrich Rate (in Tickets/second)

Less than 200 ms

Processing Time

What we loved using

Flink/Notable features

• Development cost.

• Ease of testing & development.

–Works exactly the way you expect it to work.

–Local Execution mode.

• No more OOM.

• Efficient resource management.

• Excellent performance even with limited

resources.

26

What we loved using

Flink/Notable features

Viele danke Data-Artisans !

Merci Beaucoup

• True streaming from different sources including

Kafka

–Exactly-once, low-latency, High-throughput

stream processing.

• Yarn mode features:

–Yarn yarn.maximum-failed-containers

–Yarn detached-mode

27

What’s Next?

• Connect LUX to new sources.

• Use of JobManager High Availability

• Archive Data on HDFS using the new

filesystem sink.

• Index Elasticsearch Data using the new

Elasticsearch sink.

• Flink ML.

• Contributions to the Flink Project.

28

Questions

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Technology

Transcript of Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka