Kafka at trivago

32
Apache Kafka at trivago 2017-01-25, Munich, Germany Clemens Valiente

Transcript of Kafka at trivago

Apache Kafka at trivago

2017-01-25, Munich, GermanyClemens Valiente

Email: [email protected] de.linkedin.com/in/clemensvaliente

Senior Data Engineertrivago Düsseldorf

Originally a mathematicianStudied at Uni ErlangenAt trivago for 5 years

Clemens Valiente

3

As a hotel price comparison engine, our most valuable information are hotel prices.

They are not only shown to our visitors to support their hotel booking decision, but also stored and later analyzed by Business Intelligence.

With over one million hotels and all major booking websites connected to our system, we have one of the most complete sources of information on hotel price development and trends

Collecting price information for BI

4

The past: Data pipeline 2010 – 2015

5

The past: Data pipeline 2010 – 2015

Java Software Engineering

6

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

7

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

8

The past: Data pipeline 2010 – 2015Facts & FiguresPrice dimensions- Around one million hotels- 250 booking websites- Travellers search for up to

180 days in advance- Data collected over five

years

9

The past: Data pipeline 2010 – 2015Facts & FiguresPrice dimensions- Around one million hotels- 250 booking websites- Travellers search for up to

180 days in advance- Data collected over five

years

Restrictions- Only single night stays- Only prices from

European visitors- Prices cached up to 30

minutes- One price per hotel,

website and arrival date per day

- “Insert ignore”: The first price per key wins

10

The past: Data pipeline 2010 – 2015Facts & FiguresPrice dimensions- Around one million hotels- 250 booking websites- Travellers search for up to

180 days in advance- Data collected over five

years

Restrictions- Only single night stays- Only prices from

European visitors- Prices cached up to 30

minutes- One price per hotel,

website and arrival date per day

- “Insert ignore”: The first price per key wins

Size of data- We collected a total of 56

billion prices in those five years

- Towards the end of this pipeline in early 2015 on average around 100 million prices per day were written to BI

11

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

12

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

13

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

14

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

15

The past: Data pipeline 2010 – 2015

Java Software Engineering

BI Warehouse

16

Refactoring the pipeline: Requirements

• Scales with an arbitrary amount of data (future proof)• reliable and resilient• low performance impact on Java backend• long term storage of raw input data• fast processing of filtered and aggregated data• Open source• we want to log everything:

• more prices • Length of stay, room type, breakfast info, room category, domain

• with more information• Net & gross price, city tax, resort fee, affiliate fee, VAT

17

Present data pipeline 2017 – ingestion

Düsseldorf

18

Present data pipeline 2017 – ingestion

Düsseldorf

19

Present data pipeline 2017 – ingestion

San Francisco

Düsseldorf

Hongkong

20

Present data pipeline 2017 – processing

Cam

us

21

Present data pipeline 2017 – results after two years in production• Very reliable, barely any downtime or service interruptions of the system• Java team is very happy – less load on their system• BI team is very happy – more data, more resources to process it• stakeholders very happy

• Faster results• Better quality of results due to more data• More detailed results• => Shorter research phase, more and better stories• => Less requests & workload for BI

22

Present data pipeline 2017 – facts & figures

Kafka Cluster specifications- Cluster of 5 machines in

each data centre for logs- An additional cluster of two

machines in Düsseldorf for aggregation/stream processing

23

Present data pipeline 2017 – facts & figures

Kafka Cluster specifications- Cluster of 5 machines in

each data centre for logs- An additional cluster of two

machines in Düsseldorf for aggregation/stream processing

Data Size (price log)- Over 4 trillion messages

collected so far- 10 billion messages/day- Over a hundred topics

24

Present data pipeline 2017 – facts & figures

Kafka Cluster specifications- Cluster of 5 machines in

each data centre for logs- An additional cluster of two

machines in Düsseldorf for aggregation/stream processing

Data Size (price log)- Over 4 trillion messages

collected so far- 10 billion messages/day- Over a hundred topics

Camus- Mapreduce application that

writes prices to hdfs- 15 Mappers running in

parallel- Pretty much continuously

in 10 minute intervals- To be replaced by

Gobblin/Kafka Connect

25

Present data pipeline 2017 – use cases & status quoUses for price information- Monitoring price parity in

hotel market- Anomaly and fraud

detection- Price feed for online

marketing- Display of price

development and delivering price alerts to website visitors

26

Present data pipeline 2017 – use cases & status quoUses for price information- Monitoring price parity in

hotel market- Anomaly and fraud

detection- Price feed for online

marketing- Display of price

development and delivering price alerts to website visitors

Other data sources and usage- Clicklog information from

our website and mobile app

- Used for marketing performance analysis, product tests, invoice generation etc

- Every Euro of revenue at some point was a message in Kafka

27

Present data pipeline 2017 – use cases & status quoUses for price information- Monitoring price parity in

hotel market- Anomaly and fraud

detection- Price feed for online

marketing- Display of price

development and delivering price alerts to website visitors

Other data sources and usage- Clicklog information from

our website and mobile app

- Used for marketing performance analysis, product tests, invoice generation etc

- Every Euro of revenue at some point was a message in Kafka

Status quo- Our entire BI business

logic runs on and through the kafka – hadoop pipeline

- Almost all departments rely on data, insights and metrics delivered by hadoop

- Most of the company could not do their job without hadoop data

28

Düsseldorf

Leipzig Palma

Ongoing Projects: Breaking up the Monolith

29

Düsseldorf

PalmaLeipzig

30

Key challenges and learnings

● Settle on a common message format (Avro/Protobuf, not csv or json)

● A common message envelope is helpful (e.g. header with timestamp and sender)

● For stream processing repeat your key in your message value

● Monitor your consumer offsets with an audit log, especially across data centres

● Turn off auto creation of topics, but have a process in place for topic creation

Email: [email protected] de.linkedin.com/in/clemensvaliente

Senior Data Engineertrivago Düsseldorf

Originally a mathematicianStudied at Uni ErlangenAt trivago for 5 years

Clemens Valiente

Thank you!

Questions and comments?

● Thanks to Jan Filipiak for his brainpower behind most projects

● Additional resources:

● https://github.com/trivago/gollum A n:m message multiplexer written in Go

● https://github.com/trivago/triava TriavaCache, JSR107 compliant cache