Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Real-time Data Integration with Apache

Flink & Kafka @Bouygues Telecom

Mohamed Amine ABDESSEMED

Flink Forward, Berlin, October 2015

About Me

• Software engineer & solution architect @ Bouygues

Telecom

• My daily Toolbox

– Hadoop ecosystem

– Apache Flink

– Apache Kafka

– Elasticsearch

– Apache Camel

– And more.

• If I don’t see me coding, I am probably outside running

Outline

• Who we are

• Logged User eXperience

• Challenges

• Typical Data Flow pipeline on Hadoop

• Real-time Data Integration

• Apache Flink : The Elegant

• Data Integration use case

• What we loved using Flink

• What’s Next ?

BOUYGUES TELECOM

14M Clients

11,4M Mobile

subscriber

2,6M Fixed customer

A very Innovative company

Leader 4G/4G+/UHMD

First Android based TV BOX

Mobile . Fixed . TV . Internet . Cloud

BOUYGUES TELECOM

2135819137

BouyguesTelecom

Orange Free SFR

nPerf Mobile Data Networks Global score 2G/3G/4G (Q3-

Bouygues Telecom

Orange

71% 72%

BouyguesTelecom

Orange Free SFR

Population covered in 4G

Bouygues Telecom Orange Free SFR

AT THE HUB OF OUR 14 MILLION CUSTOMERS' DIGITAL

AND WE GIVE THEMGENUINEREASONS TO STAY LOYAL

LUX: Logged User eXperienceMobile QoE

• Produce Mobile QoE indicators from massive

network equipment’s event logs (4 Billions/day).

• Goals:

– QoE (User) instead of QoS (Machine).

– Real-time Diagnostic (<60sec. end-to-end latency)

– Business Intelligence

– Real-time alarming

– Reporting

LUX: Logged User eXperienceMobile QoE

Challenges

1. Data movement

Challenges

1. Data movement

2. Data Processing

– Data is generally too raw to be used directly.

– How can we transform it ?

– How can we make the results available as soon as possible ?

Typical Data Flow pipeline on Hadoop

HADOOP

FS PUT

Client

Impala

System 1

System 2

System ...

System x

Enriched

source 1

Source 2

Source x

Source ...

Batch Batch Batch

THE WORLD MOVES FAST

DATA MUST MOVE FASTER

Real-time Data Integration

• Inspired by Linkedin’s Kafka Data

Integration design pattern.

• Take all the data and it into a central log

repository for real-time subscription.

What if the data is too raw to

used, even binary encoded with

no visible business logic

information?13

Solution 01: Process Data before pushing it to Kafka

• Not viable:

– Data sources have limited computation resources dedicated to

log collection.

– Not scalable.

– Too hard to maintain.

We’ve to push it RAW.

• Solution 02: The consumers/Data subscribers have to

process Data before using it.

– Drawbacks:

• All the consumers must implement the same business logic, run it

against the same Data.

• Any changes/updates in the processing logic will require an update of

all the consumers.

Provide a useable Data format to each consumer/Data

subscriber.

Solution 02: The consumers/Data subscribers have to process Data before using it

Solution 03: Process Kafka’s raw Data and push it back in

decoded/enriched format for subscribers.

Benefits:

– The business logic will be implemented in one place.

– Resource efficient.

– Data subscribers can focus more on their own business logic.

– Simple handling of sources/clients evolution.

Challenges:

– Keep the Data moving real time.

– The Data processing pipeline must be very fast

Client 1

Client 2

Client...

Client x

HADOOP

Impala

Kafka TOPIC RAW

Kafka TOPIC V1

Kafka TOPIC V….

Kafka TOPIC Vx

Collector

Producer

Enriched

source 1

Source 2

Source x

Source ...

Streaming

Real-time Data Integration with

Apache Kafka and Spark?

• Started a POC on Spark Streaming.

• Didn’t answer our needs:

– Poor back pressure handling

Jobs kept failing OOM on busy hours.

– Micro-batching & Latency.

– Many configuration parameters:

– The tested version used an HDFS WAL as fault tolerance

mechanism but this should be handled by Kafka.

Apache Flink : The Elegant

• True streaming, no more micro-batching !

• Nice back pressure handling.

• Fault tolerant, exactly once processing.

• High-Throughput.

• Scalable.

An open source platform for distributed stream and batch data processing.

• Rich functional APIs.

• (Almost) no constraints on serialization.

• Control of parallelism at all execution levels.

• Flexibility and ease of extension.

• And more nice stuffs.

IoT / Mobile Applications

Events occur on devices

Queue / Log

Events analyzed in a

data streaming

system

Stream Analysis

Events stored in a log

LUX: Logged User eXperience

4 Billions

raw events/day

700 GB/day

(Raw Data / snappy

compressed)

100 Data

Sources

6 Main

Data Types

26 Kafka Topics

• 6 raw

• 20 enriched

CDH5 Cluster

2 Brokers Kafka

Cluster

20 DataNodes

Mobile CDR use case

Client 1

Client 2

Client...

Client x

HADOOP

CDR_BIN

CDR_DECODED

CDR_ENRICHED

CDR_ENRICHED_ELASTIC

Planck-

Collector

DECODED

Elast icsearch

2 Weeks retention

Other IT Systems (

commercial, ...)

Each Machine

generates a binary file

every 5min or 2MB

Binary

Decoder

Common

Enricher

Lookup Live

Reference Data

Alarms/Live

Counters

Zabbix

Elast icsearch

Formater

15 min

Window

Counters

Historical

ENRICHED

BINARY

REFERENCE

LOGSTASH

Mirroring

Lookup Live

Reference Data

Network

equipment

Network

equipment

Network

equipment

Network

equipment

Network

equipment

Network

equipment

And it Rocks !

• We ran stress tests on our biggest raw Kafka topic:

– A day of Data.

– 2 Billions events (480Gib compressed).

– 10 Kafka partitions

– 10 Flink TaskManagers (Only 1GB Memory each)

Enrich Rate (in Tickets/second)Total Processing time (ms)

Kafka I/O Duration (ms)

And it Rocks !

• We ran stress tests on our biggest raw Kafka topic:

– A day of Data.

– 2 Billions events (480Gib compressed).

– 10 Kafka partitions

– 10 Flink TaskManagers (Only 1GB Memory each)

Total Processing time (ms)

Kafka I/O Duration (ms)

500 000 events/sec

1 day data

processed in 1h.

Enrich Rate (in Tickets/second)

Less than 200 ms

Processing Time

What we loved using

Flink/Notable features

• Development cost.

• Ease of testing & development.

–Works exactly the way you expect it to work.

–Local Execution mode.

• No more OOM.

• Efficient resource management.

• Excellent performance even with limited

resources.

What we loved using

Flink/Notable features

Viele danke Data-Artisans !

Merci Beaucoup

• True streaming from different sources including

–Exactly-once, low-latency, High-throughput

stream processing.

• Yarn mode features:

–Yarn yarn.maximum-failed-containers

–Yarn detached-mode

What’s Next?

• Connect LUX to new sources.

• Use of JobManager High Availability

• Archive Data on HDFS using the new

filesystem sink.

• Index Elasticsearch Data using the new

Elasticsearch sink.

• Flink ML.

• Contributions to the Flink Project.

Questions

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Technology

Transcript of Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink

Adel Abdessemed L’âge d’or · 2015-09-16 · Adel Abdessemed L’âge d’or Pre-visit and post-visit materials for teachers of students aged 12-18 Developed by Rasha Al Sarraj

Python y Flink

Apache Flink - Overview

Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest

Brent Flink Resume + Portfolio

Flink internals web

Apache flink

Apache Flink Big Data Stream Processing · PDF fileApache Flink Big Data Stream Processing Tilmann Rabl ... Apache Flink! The case for Flink as a stream processor • Ideal basis for

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

Flink Forward San Francisco 2017 - Flink meet DC/OS

Flink 1.0-slides

Lettre à Adel Abdessemed - Érudit...Lettre à Adel Abdessemed hélène cixous 29 Octobre 2014 Adel, Adel1, Hier, je reviens à ton atelier pour la première fois depuis des mois.

Apache Flink internals

Flink vs. Spark

Flink Forward 2016

Flink and Apache Spark Fernanda de Camargo Magano Dylan ... · Flink and Apache Spark Fernanda de Camargo Magano Dylan Guedes. About Flink ... Introduction to Apache Flink Book. Use

Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apache Flink

Apache Flink: The Latest and GreatestApache Flink: The Latest and Greatest 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution The Latest

Apache Flink Deep Dive