Post on 08-Jan-2017
Real-time Data Integration with Apache
Flink & Kafka @Bouygues Telecom
Mohamed Amine ABDESSEMED
Flink Forward, Berlin, October 2015
About Me
• Software engineer & solution architect @ Bouygues
Telecom
• My daily Toolbox
– Hadoop ecosystem
– Apache Flink
– Apache Kafka
– Elasticsearch
– Apache Camel
– And more.
• If I don’t see me coding, I am probably outside running
2
Outline
• Who we are
• Logged User eXperience
• Challenges
• Typical Data Flow pipeline on Hadoop
• Real-time Data Integration
• Apache Flink : The Elegant
• Data Integration use case
• What we loved using Flink
• What’s Next ?
3
BOUYGUES TELECOM
14M Clients
11,4M Mobile
subscriber
2,6M Fixed customer
A very Innovative company
Leader 4G/4G+/UHMD
First Android based TV BOX
4
Mobile . Fixed . TV . Internet . Cloud
BOUYGUES TELECOM
2135819137
14479
10595
BouyguesTelecom
Orange Free SFR
nPerf Mobile Data Networks Global score 2G/3G/4G (Q3-
2015)
Bouygues Telecom
Orange
Free
SFR
71% 72%
33%
53%
BouyguesTelecom
Orange Free SFR
Population covered in 4G
Bouygues Telecom Orange Free SFR
5
AT THE HUB OF OUR 14 MILLION CUSTOMERS' DIGITAL
LIVES
AND WE GIVE THEMGENUINEREASONS TO STAY LOYAL
LUX: Logged User eXperienceMobile QoE
• Produce Mobile QoE indicators from massive
network equipment’s event logs (4 Billions/day).
• Goals:
– QoE (User) instead of QoS (Machine).
– Real-time Diagnostic (<60sec. end-to-end latency)
– Business Intelligence
– Real-time alarming
– Reporting
7
LUX: Logged User eXperienceMobile QoE
log
s
log
s
log
s
log
s
log
s
log
s
log
s
log
s
8
Challenges
Challenges
1. Data movement
9
log
s
log
s
log
s
log
s
log
s
log
s
log
s
log
s
Challenges
1. Data movement
2. Data Processing
– Data is generally too raw to be used directly.
– How can we transform it ?
– How can we make the results available as soon as possible ?
10
Typical Data Flow pipeline on Hadoop
HADOOP
FTP
HTTP
SCOOP
FLUME
FS PUT
Client
1
Client
2
Client
...
Client
x
Impala
Hive
Hbase
TO
System 1
System 2
System ...
System x
Raw
Data
DFS
Enriched
Data
V1
DFS
Enriched
Data
V...
DFS
Enriched
Data
Vx
DFS
Data
source 1
Data
Source 2
Data
Source x
Data
Source ...
Batch Batch Batch
11
12
THE WORLD MOVES FAST
DATA MUST MOVE FASTER
Real-time Data Integration
• Inspired by Linkedin’s Kafka Data
Integration design pattern.
• Take all the data and it into a central log
repository for real-time subscription.
What if the data is too raw to
used, even binary encoded with
no visible business logic
information?13
Real-time Data Integration
Solution 01: Process Data before pushing it to Kafka
• Not viable:
– Data sources have limited computation resources dedicated to
log collection.
– Not scalable.
– Too hard to maintain.
We’ve to push it RAW.
14
Real-time Data Integration
Solution 01: Process Data before pushing it to Kafka
• Solution 02: The consumers/Data subscribers have to
process Data before using it.
– Drawbacks:
• All the consumers must implement the same business logic, run it
against the same Data.
• Any changes/updates in the processing logic will require an update of
all the consumers.
Provide a useable Data format to each consumer/Data
subscriber.
15
Real-time Data Integration
Solution 01: Process Data before pushing it to Kafka
Solution 02: The consumers/Data subscribers have to process Data before using it
Solution 03: Process Kafka’s raw Data and push it back in
decoded/enriched format for subscribers.
Benefits:
– The business logic will be implemented in one place.
– Resource efficient.
– Data subscribers can focus more on their own business logic.
– Simple handling of sources/clients evolution.
Challenges:
– Keep the Data moving real time.
– The Data processing pipeline must be very fast
16
Real-time Data Integration
Client 1
Client 2
Client...
Client x
HADOOP
Impala
Hive
Hbase
Kafka TOPIC RAW
Kafka TOPIC V1
Kafka TOPIC V….
Kafka TOPIC Vx
SYS
1TO
SYS
2
SYS
...
TO
Collector
App
Kafka
Producer
ARC
HIV
E
SYS
x
Enriched
Data
Vx
DFS
Raw
Data
DFS
Enriched
Data
V1
DFS
Enriched
Data
V...
DFS
Data
source 1
Data
Source 2
Data
Source x
Data
Source ...
Streaming
Streaming
Streaming
17
Real-time Data Integration with
Apache Kafka and Spark?
• Started a POC on Spark Streaming.
• Didn’t answer our needs:
– Poor back pressure handling
Jobs kept failing OOM on busy hours.
– Micro-batching & Latency.
– Many configuration parameters:
– The tested version used an HDFS WAL as fault tolerance
mechanism but this should be handled by Kafka.
18
Apache Flink : The Elegant
• True streaming, no more micro-batching !
• Nice back pressure handling.
• Fault tolerant, exactly once processing.
• High-Throughput.
• Scalable.
An open source platform for distributed stream and batch data processing.
• Rich functional APIs.
• (Almost) no constraints on serialization.
• Control of parallelism at all execution levels.
• Flexibility and ease of extension.
• And more nice stuffs.
19
IoT / Mobile Applications
20
Events occur on devices
Queue / Log
Events analyzed in a
data streaming
system
Stream Analysis
Events stored in a log
LUX: Logged User eXperience
LUX: Logged User eXperience
4 Billions
raw events/day
700 GB/day
(Raw Data / snappy
compressed)
100 Data
Sources
6 Main
Data Types
26 Kafka Topics
• 6 raw
• 20 enriched
CDH5 Cluster
2 Brokers Kafka
Cluster
20 DataNodes
750TB
22
23
Mobile CDR use case
Client 1
Client 2
Client...
Client x
HADOOP
CDR_BIN
CDR_DECODED
CDR_ENRICHED
CDR_ENRICHED_ELASTIC
Planck-
Collector
DECODED
CDR
HDFS
Elast icsearch
2 Weeks retention
Other IT Systems (
commercial, ...)
KPI
Views
HDFS
Each Machine
generates a binary file
every 5min or 2MB
Binary
Decoder
Common
CDR
Enricher
Lookup Live
Reference Data
Alarms/Live
Counters
Zabbix
Elast icsearch
Formater
15 min
Window
Counters
Historical
Data
ENRICHED
CDR
HDFS
BINARY
CDR
HDFS
REFERENCE
DATA
HDFS
K2ES
LOGSTASH
Kafka
Mirroring
Lookup Live
Reference Data
Network
equipment
Network
equipment
Network
equipment
Network
equipment
Network
equipment
Network
equipment
And it Rocks !
• We ran stress tests on our biggest raw Kafka topic:
– A day of Data.
– 2 Billions events (480Gib compressed).
– 10 Kafka partitions
– 10 Flink TaskManagers (Only 1GB Memory each)
Enrich Rate (in Tickets/second)Total Processing time (ms)
Kafka I/O Duration (ms)
24
And it Rocks !
• We ran stress tests on our biggest raw Kafka topic:
– A day of Data.
– 2 Billions events (480Gib compressed).
– 10 Kafka partitions
– 10 Flink TaskManagers (Only 1GB Memory each)
Total Processing time (ms)
Kafka I/O Duration (ms)
25
500 000 events/sec
1 day data
processed in 1h.
Enrich Rate (in Tickets/second)
Less than 200 ms
Processing Time
What we loved using
Flink/Notable features
• Development cost.
• Ease of testing & development.
–Works exactly the way you expect it to work.
–Local Execution mode.
• No more OOM.
• Efficient resource management.
• Excellent performance even with limited
resources.
26
What we loved using
Flink/Notable features
Viele danke Data-Artisans !
Merci Beaucoup
• True streaming from different sources including
Kafka
–Exactly-once, low-latency, High-throughput
stream processing.
• Yarn mode features:
–Yarn yarn.maximum-failed-containers
–Yarn detached-mode
27
What’s Next?
• Connect LUX to new sources.
• Use of JobManager High Availability
• Archive Data on HDFS using the new
filesystem sink.
• Index Elasticsearch Data using the new
Elasticsearch sink.
• Flink ML.
• Contributions to the Flink Project.
28
Questions