Big Data Logging Pipeline with Apache Spark and Kafka

17

Transcript of Big Data Logging Pipeline with Apache Spark and Kafka

Page 1: Big Data Logging Pipeline with Apache Spark and Kafka
Page 2: Big Data Logging Pipeline with Apache Spark and Kafka

Shipping YaaS logs with Apache Spark and KafkaDogukan Sonmez

Senior Software Engineer @hybris Software@dogukansonmez

Page 3: Big Data Logging Pipeline with Apache Spark and Kafka

Agenda

² Introduction to Yaas

² Architecture of Logging pipeline

² Technology behind logging pipeline

² Challenges

² Recap

² Q&A

Page 4: Big Data Logging Pipeline with Apache Spark and Kafka

What is YaaS

Page 5: Big Data Logging Pipeline with Apache Spark and Kafka

SAP hybris as a Service (YaaS)

A micro-service based Business PaaS

Integrated with hybris and SAP Solutions

Build

Publish

Fast

Page 6: Big Data Logging Pipeline with Apache Spark and Kafka

yaas.io

Page 7: Big Data Logging Pipeline with Apache Spark and Kafka

Architecture of Logging pipeline

Page 8: Big Data Logging Pipeline with Apache Spark and Kafka

Architecture of Logging pipeline

Page 9: Big Data Logging Pipeline with Apache Spark and Kafka

Technology behind logging pipeline

High Throughput messaging

BrokerDistributed

Scalable

Fault Tolerant

TopicPartition

Replicated

Offset

Page 10: Big Data Logging Pipeline with Apache Spark and Kafka

Technology behind logging pipeline

Micro Batching RDD

Streaming

DAG

Reliable

ML

Scalable

Graph

Fast

Page 11: Big Data Logging Pipeline with Apache Spark and Kafka

Big Data pipeline challenges

Reliability of Kafka

v 3 Brokers

v 3 Zookeeper instances

v default.replication.factor=2

v Mainly with Default Configurations

v 5 Brokers

v 5 Zookeeper instances

v unclean.leader.election.enable=false

v min.insync.replicas=2

v default.replication.factor=3

BEFORE AFTER

Page 12: Big Data Logging Pipeline with Apache Spark and Kafka

Big Data pipeline challenges

Spark Streaming Checkpointing

v Spark checkpointing

v All RDD serialized and stored at HDFS

v Custom kafka checkpointing

(Only latest offset stored at kafka)

BEFORE AFTER

Page 13: Big Data Logging Pipeline with Apache Spark and Kafka

Big Data pipeline challenges

Elasticsearch indexing big data

v Default mapping

v index.refresh_interval = 1s

v Indices.memory_index_buffer_size= 10%

v Custom mapping with disabled norms

v Mapping using simple analyzer

v index.refresh_interval = 30s

v Indices.memory_index_buffer_size= 30%

v spark.streaming.kafka.maxRatePerPartition=10000

BEFORE AFTER

Page 14: Big Data Logging Pipeline with Apache Spark and Kafka

Recap

Page 15: Big Data Logging Pipeline with Apache Spark and Kafka

Recap

Page 16: Big Data Logging Pipeline with Apache Spark and Kafka

Q&A

Page 17: Big Data Logging Pipeline with Apache Spark and Kafka

https://hackingat.hybris.com