Kafka Streams: The Stream Processing Engine of Apache Kafka

Click here to load reader

download Kafka Streams: The Stream Processing Engine of Apache Kafka

of 57

  • date post

    19-Feb-2017
  • Category

    Technology

  • view

    242
  • download

    10

Embed Size (px)

Transcript of Kafka Streams: The Stream Processing Engine of Apache Kafka

PowerPoint Presentation

Kafka Streams: The New Smart Kid On The BlockThe Stream Processing Engine of Apache KafkaEno Thereskaeno@confluent.io

enothereska

Big Data London 2016Slide contributions: Michael Noll

#Confidential

Apache Kafka and Kafka Streams API

#Confidential

Kafka started as a high throughput messaging layer or pub/subIt has evolved to be a streaming platform through the Kafka Streams library. Today its about Kafka Streams.Kafka Connect brings data into Kafka, we wont spend much time in this talk on that.2

What is Kafka Streams: Unix analogy$ cat < in.txt | grep apache | tr a-z A-Z > out.txt

Kafka Core

Kafka Connect

Kafka Streams

#Confidential

When to use Kafka Streams

Mainstream Application DevelopmentWhen running a cluster would suckMicroservicesFast Data apps for small and big dataLarge-scale continuous queries and transformationsEvent-triggered processesReactive applicationsThe T in ETL

Use case examplesReal-time monitoring and intelligenceCustomer 360-degree viewFraud detectionLocation-based marketingFleet management

#Confidential

Some use cases in the wild & external articlesApplying Kafka Streams for internal message delivery pipeline at LINE Corp.http://developers.linecorp.com/blog/?p=3960 Kafka Streams in production at LINE, a social platform based in Japan with 220+ million usersMicroservices and reactive applicationshttps://speakerdeck.com/bobbycalderwood/commander-decoupled-immutable-rest-apis-with-kafka-streams User behavior analysishttps://timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html Containerized Kafka Streams applications in Scalahttps://www.madewithtea.com/processing-tweets-with-kafka-streams.html Geo-spatial data analysishttp://www.infolace.com/blog/2016/07/14/simple-spatial-windowing-with-kafka-streams/ Language classification with machine learninghttps://dzone.com/articles/machine-learning-with-kafka-streams

#Confidential

Architecture comparison: use case example

Real-time dashboard for security monitoringWhich of my data centers are under attack?

#Confidential

Architecture comparison: use case exampleOther AppDashboard Frontend AppOther App1Capture businessevents in Kafka

2Must process events withseparate cluster (e.g. Spark)4Other apps access latest resultsby querying these DBs

3Must share latest results throughseparate systems (e.g. MySQL)Before: Undue complexity, heavy footprint, many technologies, split ownership with conflicting priorities

Your JobOther AppDashboard Frontend AppOther App1Capture businessevents in Kafka

2Process events with standardJava apps that use Kafka Streams3Now other apps can directlyquery the latest resultsWith Kafka Streams: simplified, app-centric architecture, puts app owners in control

KafkaStreamsYour App

Conflicting priorities: infrastructure teams vs. product teamsComplexity: a lot of moving pieces that are also complex individuallyIs all this a part of the solution or part of your problem?

#Confidential

Depending on your use case, you may even collapse the Streams app and the Dashboard Frontend App into a single app!Simplify your architecture: for many use cases, you no longer need a separate processing cluster and/or external databases.Empower and decouple your organization:Before, the diversity of the required technology stack typically meant that different parts of the organization were responsible for different parts of a data pipeline (e.g. the infrastructure team that provides Spark as a shared, horizontal service to the whole company vs. the product teams in the LOB who are actually driving the business and who are running on top of these shared services).With Confluent Platform and Kafka, many use cases can be implemented in a much more simplified, light-weight fashion thus taking stream processing out of the Big Data niche into mainstream application develoment. Here, you implement, test, deploy, run, and monitor stream processing applications just like any other applications within your company.This decoupling of teams and ownerships enables faster, more agile development.7

How do I install Kafka Streams?There is and there should be no installation Build Apps, Not Clusters!Its a library. Add it to your app like any other library.

org.apache.kafka kafka-streams 0.10.0.1

#Confidential

How do I package and deploy my apps? How do I ?Whatever works for you. Stick to what you/your company think is the best way.Kafka Streams integrates well with what you already have.Why? Because an app that uses Kafka Streams isa normal Java app.

#Confidential

FYI: Mesosphere is a Confluent partner. You can run the Confluent Platform on Mesosphere DC/OS.9

Available APIs

#Confidential

Alright, lets still show the APIs before we continue with the more interesting bits and pieces.10

API option 1: Kafka Streams DSL (declarative)KStream input = builder.stream("numbers-topic");

// Stateless computationKStream doubled = input.mapValues(v -> v * 2);

// Stateful computationKTable sumOfOdds = input .filter((k,v) -> v % 2 != 0) .selectKey((k, v) -> 1) .groupByKey() .reduce((v1, v2) -> v1 + v2, "sum-of-odds");

The preferred API for most use cases.

The DSL particularly appeals to users:When familiar with Spark, FlinkWhen fans of Scala or functional programming

#Confidential

API option 2: Processor API (imperative)class PrintToConsoleProcessor implements Processor {

@Override public void init(ProcessorContext context) {}

@Override void process(K key, V value) { System.out.println("Received record with " + "key=" + key + " and value=" + value); }

@Override void punctuate(long timestamp) {}

@Override void close() {}}Full flexibility but more manual work

The Processor API appeals to users:When familiar with Storm, SamzaStill, check out the DSL!When requiring functionality that isnot yet available in the DSL

#Confidential

My WordCount is better than your WordCount (?)KafkaSpark

These isolated code snippets are nice (and actually quite similar) but they are not very meaningful. In practice, we also need to read data from somewhere, write data back to somewhere, etc. but we can see none of this here.

#Confidential

The purpose of the following comparison is to show concrete examples(rather than high-level talk) to point out some general differences betweenKafka Streams and other, related stream processing technologies.

Here, we happen to compare Kafka Streams with Spark.But we could have also compared with Storm, Samza, Akka Streams, etc.

13

WordCount in Kafka

WordCount

#Confidential

14

Compared to: WordCount in Spark 2.0123Runtime model leaks into processing logic(here: interfacing from Spark with Kafka)

#Confidential

15

Compared to: WordCount in Spark 2.0

45Runtime model leaks into processing logic(driver vs. executors)

#Confidential

16

#Confidential

Kafka Streams key concepts

#Confidential

18

Key concepts

#Confidential

Kafka Streams has a very similar data model as Kafka Core. (We wont need to go into further details here.)19

Key concepts

#Confidential

The processing logic of your application is defined through a processor topology, which consists of nodes (stream processors = stream processing steps) and edges (streams) that connect the nodes. If you use the DSL, all of this is hidden from you. If you use the Processor API, you can (and must) define the topology manually.20

Key concepts

Kafka CoreKafka Streams

#Confidential

Kafka partitions data for storing and transporting it.Kafka Streams partitions data for processing it.In both cases, this partitioning enables data locality, elasticity, scalability, high performance, and fault-tolerance.

Thats all we want to mention at this point because we want to rather talk about something more interesting and more important: the stream-table duality.21

Streams meet Tables

#Confidential

Streams meet Tables

http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables

#Confidential

Motivating example: continuously compute current users per geo-region

475328Real-time dashboardHow many users younger than 30y, per region?

aliceAsia, 25y, bobEurope, 46y, user-locations(mobile team)user-prefs(web team)

#Confidential

Imagine that one of your products is a real-time dashboard that shows the current number of users in each region of the world. This dashboard is powered by a stream processing application that (1) builds detailed user profiles by joining data from a variety of Kafka topics and (2) computes statistics on these user profiles. The Kafka topics could include, for example, a user-preferences topic based on direct user input (e.g. birthday/age) plus a user-location topic that receives automatic geo-location updates from a users mobile device (In Paris right now!). Also, it is pretty likely that, in practice, the various information sources (topics) are managed by different teams: the user-locations topic could be managed by the mobile app team, and the user-prefs topic by the frontend team.

Now what you want to happen is that every time there is a new piece of information arriving in the upstream Kafka topics (say, a geo-location update for user Al