Apache Kafka - Martin Podval

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Apache Kafka - Martin Podval

  • ApacheKafka

    @MartinPodval, hpsv.cz

  • What is Apache Kafka?

    Messaging SystemDistributedPersistent and ReplicableVery fast - low latency - and scalableSimple but highly configurableBy Linkedin, open sourced under apache.org

  • Data Streaming

    New kind of data ... User or application data (events) streams Monitoring - App, System App Logging High volume

  • Data Streaming Contd

    you want to process Using various components Into a target form Map, reduce, shuffle Real time or batch

  • HP Service Virtualization Use Cases

    Process of clients message streams

    Real-time performance modeling

    Logs aggregation

  • How To Solve It?

    Producers and Consumers Distributed Decoupled Configurable Dynamic

  • Kafka Cluster

    Brokers = Instances, Nodes Topics Partitions Replicas

    ZK Coordination

  • Kafka Topics

    Commit Log Immutable Ordered Sequential Offset

  • Kafka Topics Contd

    PartitionedIndependently: Stored Produced Consumed


    Replicated On partition basis Different brokers

    Fault Tolerant

  • What Can I Do?

    producer.write(topic_id, message);

    consumer.read(topic_id, offset);

  • I Want To Produce

    java/scala client address of one or more brokers choose a topic where to produce highly configurable and tunable:

    partitioner number of acks (async=0, master=1, replicas=1+?) batching, buffer size, timeouts, retries, ...

  • I Want To Consume

    High Level API Groups abstraction

    To All, To One To Some

    Stream API Stores positions to support fault tolerance

  • I Want To Consume Contd

    Low Level Java/scala client Find a leader for a topic Calculate an offset Fetches messages

    Re-consume if needed

  • I Want To Consume Contd

    Delivery Semantic: At most once At least once Exactly once

  • Kafka Internals - Disks

    Avoid: GC Random disk


  • Kafka Internals - Disks Contd

    Disks are fast ...

    when properly used sequential access - read ahead, write behind rely on operating system

    avoid heap, materialization and GC its more like file copy over network

    Its easy with immutable topics

  • Kafka Internals - Replication

    In Sync Replicas Replication factor on partition basis One leader + 0..n replicas Replicas are consumers

    In Sync if they are not too far behind a leader Batch sync

  • Kafka Internals - Replication Contd

    Tunable Trade-Offs Producers write method:

    Not blocked, async Waits for master ACK Waits for all in-sync replicas

    Consumer pulls only committed messages Servers minimum in-sync replicas

  • Performance


    Scales with: clients count, message size number of replicas, partitions or topics

    Depends on network and disk throughput

  • Performance Contd

    Our testing 3 nodes, master + 2 replicas 500 000 msg/s (100 bytes[]) 400 mbit/s - 1.2 gbit/s network throughput end2end latency 2-3 ms

    @see http://bit.ly/1FsIR9a

  • Easy of Use

    No installation, just run a java/scala program

    Streams in files & dirs Transparent zookeeper Ecosystem

  • Cons

    Beta version Dependency on Zookeeper The way how it is written in Scala No easy way how to remove messages

  • Questions?