Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

download Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

of 30

  • date post

    05-Apr-2017
  • Category

    Technology

  • view

    431
  • download

    6

Embed Size (px)

Transcript of Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Streaming Data Ecosystems

Spark Streaming+ KafkaBest Practices

Brandon OBrien@hakczarExpedia, Inc

Tell our story, to share learnings1

OrA Case Study in Operationalizing Spark Streaming

This was our use case, yours may be different2

Context/Disclaimer

Our use case: Build resilient, scalable data pipeline with streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.

Spark Streaming 1.5-1.6, Kafka 0.9

Standalone Cluster (not YARN or Mesos)

No Hadoop

Message velocity: k/s. Batch window: 10s

Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)

This is our use case, yours may be different3

Demo: Spark in Action

Live system to reason about4

Game & Scoreboard Architecture

5

OutlineSpark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

OutlineSpark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Spark Streaming & Standalone Cluster Overview RDD: Partitioned, replicated collection of data objectsDriver: JVM that creates Spark program, negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.Lazy Execution: Transformations & ActionsCluster Types: Standalone, YARN, Mesos

Spark Streaming & Standalone Cluster Overview Standalone ClusterEach nodeMasterWorkerExecutorDriverZookeeper cluster

Not necessarily the only way to set it up. Save IP space9

OutlineSpark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Design Patterns for Performance

Delegate all IO/CPU to the ExecutorsAvoid unnecessary shuffles (join, groupBy, repartition)Externalize streaming joins & reference data lookups. Large/volatile ref data set.JVM static hashmapExternal cache (e.g. Redis)Static LRU cache (amortize lookups)RocksDBHygienic function closures

Ok, we built the app in the spark framework for scalability, made it fast, 11

Were done, right?

Were done, right?Just need to QA the data

70% missing data

Pause, check on game player14

OutlineSpark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Guaranteed Message Processing & Direct Kafka IntegrationGuaranteed Message Processing = At-least-once processing + idempotenceKafka ReceiverConsumes messages faster than Spark can processCheckpoints before processing finishedInefficient CPU utilizationDirect Kafka IntegrationControl over checkpointing & transactionalityBetter distribution on resource consumption1:1 Kafka Topic-partition to Spark RDD-partitionUse Kafka as WAL Statelessness, Fail-fast

Spark is hiding the fact that it cant keep up with the stream. Crash + restart + bad checkpoint = missing messages.Config to ameliorate, artifact of absence of WAL/HDFS. Multiple data loss scenariosDirect Kafka Integration = statelessness16

OutlineSpark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Operational Monitoring& AlertingDriver HeartbeatBatch processing timeMessage count

Kafka lag (latest offsets vs last processed)Driver start eventsStatsD + Graphite + Seyrenhttp://localhost:4040/metrics/json/

Simple, At a glance, batch process time < batch interval.Strong Checkpointing strategy(direct) + fail fast / idempotent code, then driver heart beat + kafka lag = confidence18

Data loss fixed

After a few days, we notice19

Data loss fixedSo were done, right?

After a few days, we notice20

Cluster & appcontinuously crashing

I thought resiliency was the promise of Spark. Resilient distributed datasets21

OutlineSpark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

The app was crashing, but why22

Spark Cluster & App Stability

Spark slave memory utilization

23

Spark Cluster & App Stability

Slave memory overheadOOM killerCrashes + Kafka Receiver = missing dataSupervised driver: --supervise for spark-submit. Driver restart loggingCluster resource overprovisioningStandby Masters for failoverAuto-cleanup of work directories spark.worker.cleanup.enabled=true

Crashes while using Kafka Receiver = missing data. No WALIs Spark so flaky?Spark was being attacked by the operating systemand doing surprisingly well given the circumstance, especially with the direct kafka Integration and checkpointingGoal: have enough resiliency, redundancy, idempotence, checkpointing. Multiple failures without causing problems.24

Were done, right?

Were done, right?Finally, yes

Party Time

TL;DR

Use Direct Kafka Integration + transactionalityCache reference data for speedAvoid shuffles & driver bottlenecksSupervised driverCleanup worker temp directoryBeware of function closuresCluster resource over-provisioningSpark slave memory headroom Monitoring on Driver heartbeat & Kafka lagStandby masters

Spark Streaming+ KafkaBest Practices

Brandon OBrien@hakczarExpedia, Inc

Thanks!

LinksOperationalization Spark Streaming: https://techblog.expedia.com/2016/12/29/operationalizing-spark-streaming-part-1/Direct Kafka Integration: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.htmlApp metrics: http://localhost:4040/metrics/json/ MetricsSystem: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/sparkConf.set("spark.worker.cleanup.enabled", "true")