Big Data Logging Pipeline with Apache Spark and Kafka

Shipping YaaS logs with Apache Spark and KafkaDogukan Sonmez

Senior Software Engineer @hybris Software@dogukansonmez

Agenda

² Introduction to Yaas

² Architecture of Logging pipeline

² Technology behind logging pipeline

² Challenges

² Recap

² Q&A

What is YaaS

SAP hybris as a Service (YaaS)

A micro-service based Business PaaS

Integrated with hybris and SAP Solutions

Build

Publish

Fast

yaas.io

Architecture of Logging pipeline

Technology behind logging pipeline

High Throughput messaging

BrokerDistributed

Scalable

Fault Tolerant

TopicPartition

Replicated

Offset

Technology behind logging pipeline

Micro Batching RDD

Streaming

DAG

Reliable

ML

Scalable

Graph

Fast

Big Data pipeline challenges

Reliability of Kafka

v 3 Brokers

v 3 Zookeeper instances

v default.replication.factor=2

v Mainly with Default Configurations

v 5 Brokers

v 5 Zookeeper instances

v unclean.leader.election.enable=false

v min.insync.replicas=2

v default.replication.factor=3

BEFORE AFTER


Spark Streaming Checkpointing

v Spark checkpointing

v All RDD serialized and stored at HDFS

v Custom kafka checkpointing

(Only latest offset stored at kafka)

BEFORE AFTER


Elasticsearch indexing big data

v Default mapping

v index.refresh_interval = 1s

v Indices.memory_index_buffer_size= 10%

v Custom mapping with disabled norms

v Mapping using simple analyzer

v index.refresh_interval = 30s

v Indices.memory_index_buffer_size= 30%

v spark.streaming.kafka.maxRatePerPartition=10000

BEFORE AFTER

https://hackingat.hybris.com

Big Data Logging Pipeline with Apache Spark and Kafka

Data & Analytics

Transcript of Big Data Logging Pipeline with Apache Spark and Kafka