Save money for all your purchase on trivago using trivago coupon codes & discount vouchers
Kafka at trivago
-
Upload
clemens-valiente -
Category
Data & Analytics
-
view
83 -
download
2
Transcript of Kafka at trivago
Email: [email protected] de.linkedin.com/in/clemensvaliente
Senior Data Engineertrivago Düsseldorf
Originally a mathematicianStudied at Uni ErlangenAt trivago for 5 years
Clemens Valiente
3
As a hotel price comparison engine, our most valuable information are hotel prices.
They are not only shown to our visitors to support their hotel booking decision, but also stored and later analyzed by Business Intelligence.
With over one million hotels and all major booking websites connected to our system, we have one of the most complete sources of information on hotel price development and trends
Collecting price information for BI
8
The past: Data pipeline 2010 – 2015Facts & FiguresPrice dimensions- Around one million hotels- 250 booking websites- Travellers search for up to
180 days in advance- Data collected over five
years
9
The past: Data pipeline 2010 – 2015Facts & FiguresPrice dimensions- Around one million hotels- 250 booking websites- Travellers search for up to
180 days in advance- Data collected over five
years
Restrictions- Only single night stays- Only prices from
European visitors- Prices cached up to 30
minutes- One price per hotel,
website and arrival date per day
- “Insert ignore”: The first price per key wins
10
The past: Data pipeline 2010 – 2015Facts & FiguresPrice dimensions- Around one million hotels- 250 booking websites- Travellers search for up to
180 days in advance- Data collected over five
years
Restrictions- Only single night stays- Only prices from
European visitors- Prices cached up to 30
minutes- One price per hotel,
website and arrival date per day
- “Insert ignore”: The first price per key wins
Size of data- We collected a total of 56
billion prices in those five years
- Towards the end of this pipeline in early 2015 on average around 100 million prices per day were written to BI
16
Refactoring the pipeline: Requirements
• Scales with an arbitrary amount of data (future proof)• reliable and resilient• low performance impact on Java backend• long term storage of raw input data• fast processing of filtered and aggregated data• Open source• we want to log everything:
• more prices • Length of stay, room type, breakfast info, room category, domain
• with more information• Net & gross price, city tax, resort fee, affiliate fee, VAT
21
Present data pipeline 2017 – results after two years in production• Very reliable, barely any downtime or service interruptions of the system• Java team is very happy – less load on their system• BI team is very happy – more data, more resources to process it• stakeholders very happy
• Faster results• Better quality of results due to more data• More detailed results• => Shorter research phase, more and better stories• => Less requests & workload for BI
22
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications- Cluster of 5 machines in
each data centre for logs- An additional cluster of two
machines in Düsseldorf for aggregation/stream processing
23
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications- Cluster of 5 machines in
each data centre for logs- An additional cluster of two
machines in Düsseldorf for aggregation/stream processing
Data Size (price log)- Over 4 trillion messages
collected so far- 10 billion messages/day- Over a hundred topics
24
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications- Cluster of 5 machines in
each data centre for logs- An additional cluster of two
machines in Düsseldorf for aggregation/stream processing
Data Size (price log)- Over 4 trillion messages
collected so far- 10 billion messages/day- Over a hundred topics
Camus- Mapreduce application that
writes prices to hdfs- 15 Mappers running in
parallel- Pretty much continuously
in 10 minute intervals- To be replaced by
Gobblin/Kafka Connect
25
Present data pipeline 2017 – use cases & status quoUses for price information- Monitoring price parity in
hotel market- Anomaly and fraud
detection- Price feed for online
marketing- Display of price
development and delivering price alerts to website visitors
26
Present data pipeline 2017 – use cases & status quoUses for price information- Monitoring price parity in
hotel market- Anomaly and fraud
detection- Price feed for online
marketing- Display of price
development and delivering price alerts to website visitors
Other data sources and usage- Clicklog information from
our website and mobile app
- Used for marketing performance analysis, product tests, invoice generation etc
- Every Euro of revenue at some point was a message in Kafka
27
Present data pipeline 2017 – use cases & status quoUses for price information- Monitoring price parity in
hotel market- Anomaly and fraud
detection- Price feed for online
marketing- Display of price
development and delivering price alerts to website visitors
Other data sources and usage- Clicklog information from
our website and mobile app
- Used for marketing performance analysis, product tests, invoice generation etc
- Every Euro of revenue at some point was a message in Kafka
Status quo- Our entire BI business
logic runs on and through the kafka – hadoop pipeline
- Almost all departments rely on data, insights and metrics delivered by hadoop
- Most of the company could not do their job without hadoop data
30
Key challenges and learnings
● Settle on a common message format (Avro/Protobuf, not csv or json)
● A common message envelope is helpful (e.g. header with timestamp and sender)
● For stream processing repeat your key in your message value
● Monitor your consumer offsets with an audit log, especially across data centres
● Turn off auto creation of topics, but have a process in place for topic creation
Email: [email protected] de.linkedin.com/in/clemensvaliente
Senior Data Engineertrivago Düsseldorf
Originally a mathematicianStudied at Uni ErlangenAt trivago for 5 years
Clemens Valiente
Thank you!
Questions and comments?
● Thanks to Jan Filipiak for his brainpower behind most projects
● Additional resources:
● https://github.com/trivago/gollum A n:m message multiplexer written in Go
● https://github.com/trivago/triava TriavaCache, JSR107 compliant cache