Introduction à kafka

36
Introduction to Apache Kafka

Transcript of Introduction à kafka

Introduction to Apache Kafka

About me- My name is Jonathan Winandy (@ahoy_jon).

- I am a Data pipeline engineer :

- I worked on a “DataLake” !

- I use tools in the larger Java ecosystem like Java, Scala, Clojure, Hadoop …

- And I am an “entrepreneur”.

> Introduction

I cofounded

We do health care oriented software engineering.

We provide : - Coordination for health care professionals. - “Big health care Data” pipelines.

> Introduction

So let’s talk about Streams

What is a Stream ? It’s an abstract data structure with the following :

operations : • append(bytes) -> void? • readAt(int) -> null | bytes

rule 1 : ∀p ∈ ℕ, for some definition of ‘==‘ x := readAt(p) y := readAt(p)

x != null => x == y

Rule 1 implies : Infinite cacheability once the data is available at a position.

> Theory

Streams are the simplest way to manage data.

And they are naturally compatible with the perception of information from a singular observer …

0 1 2 3 4 5 6

> Theory

And we know that since the XVth century …

So what happened ?

Memorial Journal Ledger

> History

> History

> Dist systems

@aphyr : Jepsen II Linearizable Boogaloo

Distributed Systems

Are HARD

https://www.youtube.com/watch?v=ggCffvKEJmQ

Peter Alvaro - Outwards from the Middle of the Maze

> Dist systems

> Dist systems

=> Idempotence

> Summary

The need of unified log arises ‘quickly’ in apps that manage state (or multiple states) when they need to do :

- Business Intelligence, - Notifications, - Advanced search (secondary indexation), - ….

But there is a lot of legacy in projets and practices, this technique has been regularly “forgotten*”.

> Basic anatomy of Kafka

Topic : _test

But do use that, we need a bit of coordination

Broker 1

Broker 2

Broker3

ZK

Producer

1. Hello ZK, do you know where I can find

some brokers ?

2. Ahoy ?

3. Want some data ?

> Producer

Demo

Message acking for producer (“write concern”)

0 : Here a messa’

1 : If at least you, leader of this partition, received it and saved it, I am ok.

-1: Hey, I just send you a message, I know it’s maybe to much to ask, But are you really sure you saved it ? Ok, and did all brokers in the “In Sync Replicas” for this partition did too ? I now I am … but this information is really important for our $$$$.

Speed

Durability

> Producer

Consumers flatmap that Log

Conclusion

State : A timeless way for failure

Questions ?

https://github.com/bulldog2011/luxun

Kafka as unique properties, PLEASE : don’t try to use something else instead.

We should talk about CAP

But CAP is about mutation

And consistency is a complicated subject

And consistency is a complicated subject

A quick note on CausalityIf you don’t ensure causality for web apps, some strange comportements may arise :

Sometimes, as a user, I cannot see my own “edits”.

Sometimes, as a client, I cannot buy on the website after I checkout my basket.

APP APP

“Who is the fastest between the Data bus and the client ?”You don’t want to bet, especially under load.

Bonus :What is a CAS ?A Content Adressable Storage is a specific “key value store” :

operations : - store(bytes) -> key - get(key) -> null | bytes

rule 1 : key = h(data) h being a cryptographic hash function like md5 or sha1.

rule 2 : ∀data get(store(data)) = data

Rule 1 and 2 imply : Infinite cacheability and scalability.

Exemple of architecturesCLASSICAL

APP

APP

DB

APPAPP

append

broadcast

WITH STREAMS

Exemple of architecturesCLASSICAL

APP

REPLICATION(BIN/LOG)

APP

APP

DB

DB

APPAPP

APP

append

broadcast

WITH STREAMS

The broadcast mechanism is equivalent to a db replication mechanism.