Spark Streaming & Kafka-The Future of Stream Processing

download Spark Streaming & Kafka-The Future of Stream Processing

of 27

Embed Size (px)

Transcript of Spark Streaming & Kafka-The Future of Stream Processing

PowerPoint Presentation

8/20/15Jack Gudenkauf VP Big Datascala> sc.parallelize(List("Kafka Spark Vertica"), 3).mapPartitions(iter => { iter.toList.map(x=>print(x)) }.iterator).collect; println()https://twitter.com/_JG

|

1

2Playtika Founded in 2010 Social Casino global category leader 10 games 13 platforms 1000+ employees

2

Hari Shreedharan, Software Engineer @ ClouderaCommitter/PMC Member, Apache FlumeCommitter, Apache SqoopContributor, Apache SparkAuthor, Using Flume (OReilly)Spark + Kafka:Future of Streaming Processing

# Cloudera, Inc. All rights reserved.Motivation for Real-Time Stream ProcessingData is being created at unprecedented ratesExponential data growth from mobile, web, socialConnected devices: 9B in 2012 to 50B by 2020Over 1 trillion sensors by 2020Datacenter IP traffic growing at CAGR of 25%How can we harness it data in real-time?Value can quickly degrade capture value immediatelyFrom reactive analysis to direct operational impactUnlocks new competitive advantagesRequires a completely new approach...

# Cloudera, Inc. All rights reserved.The Narrative:Vast quantities of streaming data are being generated, and more will be generated thanks to phenomenon SINGULAR/PLURAL

such as the internet of things. The motivation for Real-Time Stream processing is to turn all this data into valuable insights and actions, as soon as the data is generated.Instant processing of the data also opens the door to new use cases that were not possible before.

NOTE:Feel free to remove the cheesy image of The Flash, if it feels unprofessional or overly cheesy

4

From Volume and Variety to Velocity

PresentBatch + Stream Processing

Time to Insight of Seconds

Big-Data = Volume + Variety

Big-Data = Volume + Variety + Velocity

PastPresentHadoop Ecosystem evolves as wellPastBig Data has evolvedBatch Processing

Time to insight of Hours

# Cloudera, Inc. All rights reserved.The Narrative:As you can see from the previous slides, lots of streaming data will be generated. Making this data actionable in real time is very valuable across industries.Our very own Hadoop is all you need.Previously Hadoop was associated just with big unstructured data. That was hadoops selling point.But now, Hadoop can also handle real-time data (in addition to big unstructured). So think Hadoop when you think Real-Time Streaming.

Purpose of the slide:Goal is to associate Hadoop with real-timeto get people to think hadoop when they think real-time streaming data.

5

Key Components of Streaming Architectures

Data Ingestion & TransportationService Real-Time Stream Processing Engine

Kafka Flume

System ManagementSecurity

Data Management & IntegrationReal-TimeData Serving

# Cloudera, Inc. All rights reserved.

6

Canonical Stream Processing Architecture

Kafka

Data Ingest

App 1App 2...

Kafka

FlumeHDFS

HBaseData Sources

# Cloudera, Inc. All rights reserved.

7

Spark: Easy and Fast Big DataEasy to DevelopRich APIs in Java, Scala, PythonInteractive shellFast to RunGeneral execution graphsIn-memory storage2-5 less codeUp to 10 faster on disk,100 in memory

# Cloudera, Inc. All rights reserved.Spark Architecture

Driver

Worker

WorkerWorkerDataRAMDataRAMDataRAMTasksResults

# Cloudera, Inc. All rights reserved.RDDsRDD = Resilient Distributed DatasetsImmutable representation of dataOperations on one RDD creates a new oneMemory caching layer that stores data in a distributed, fault-tolerant cacheCreated by parallel transformations on data in stable storageLazy materialization

Two observations:Can fall back to disk when data-set does not fit in memoryProvides fault-tolerance through concept of lineage

# Cloudera, Inc. All rights reserved.Spark StreamingExtension of Apache Sparks Core API, for Stream Processing.

The Framework Provides

Fault ToleranceScalabilityHigh-Throughput

# Cloudera, Inc. All rights reserved.Purpose of this Slide:Make sure to associate Spark Streaming with Apache Spark, so folks know it is a part of THE Apache Spark that everyone is talking about.List some of the key properties that make Spark Streaming a good platform for stream processing. Touch upon the key attributes that make it good for stream processing.

Note:If required, we can mention low latency as well.

11

Spark StreamingIncoming data represented as Discretized Streams (DStreams)Stream is broken down into micro-batchesEach micro-batch is an RDD can share code between batch and streaming

# Cloudera, Inc. All rights reserved.val tweets = ssc.twitterStream()val hashTags = tweets.flatMap (status => getTags(status))hashTags.saveAsHadoopFiles("hdfs://...")

flatMap

flatMap

flatMap

savesavesavebatch @ t+1batch @ tbatch @ t+2tweets DStreamhashTags DStreamStream composed of small (1-10s) batch computationsMicro-batch Architecture

# Cloudera, Inc. All rights reserved.Use DStreams for Windowing Functions

# Cloudera, Inc. All rights reserved.Spark StreamingRuns as a Spark jobYARN or standalone for schedulingYARN has KDC integrationUse the same code for real-time Spark Streaming and for batch Spark jobs.Integrates natively with messaging systems such as Flume, Kafka, Zero MQ.Easy to write Receivers for custom messaging systems.

# Cloudera, Inc. All rights reserved.Sharing Code between Batch and Streamingdef filterErrors (rdd: RDD[String]): RDD[String] = {rdd.filter(s => s.contains(ERROR))}

Library that filters ERRORS

Streaming generates RDDs periodicallyAny code that operates on RDDs can therefore be used in streaming as well

# Cloudera, Inc. All rights reserved.Sharing Code between Batch and Streaming

val lines = sc.textFile()val filtered = filterErrors(lines)filtered.saveAsTextFile(...)Spark:val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => { filterErrors(rdd)}))filtered.saveAsTextFiles()Spark Streaming:

# Cloudera, Inc. All rights reserved.ReliabilityReceived data automatically persisted to HDFS Write Ahead Log to prevent data lossset spark.streaming.receiver.writeAheadLog.enable=true in spark confWhen AM dies, the application is restarted by YARNReceived, ack-ed and unprocessed data replayed from WAL (data that made it into blocks)Reliable Receivers can replay data from the original source, if requiredUn-acked data replayed from source.Kafka, Flume receivers bundled with Spark are examplesReliable Receivers + WAL = No data loss on driver or receiver failure!

# Cloudera, Inc. All rights reserved.Reliable Kafka DStreamStores received data to Write Ahead Log on HDFS for replay no data loss!Stable and supported!Uses a reliable receiver to pull data from KafkaApplication-controlled parallelismCreate as many receivers as you want to parallelizeRemember each receiver is a task and holds one executor hostage, no processing happens on that executor.Tricky to do this efficiently, so is controlling ordering (everything needs to be done explicitly

# Cloudera, Inc. All rights reserved.Reliable Kafka Dstream - IssuesKafka can replay messages if processing failed for some reason So WAL is overkill causes unnecessary performance hitIn addition, the Reliable Stream causes a lot of network traffic due to unneeded HDFS writes etc.Receivers hold executors hostage which could otherwise be used for processingHow can we solve these issues?

# Cloudera, Inc. All rights reserved.Direct Kafka DStreamNo long-running receiver = no executor hogging!Communicates with Kafka via the low-level API1 Spark partition Kafka partitionAt the end of every batch:The first message after the last batch to the current latest message in partitionIf max rate is configured, then rate x batch interval is downloaded & processedCheckpoint contains the starting and ending offset in the current RDDRecovering from checkpoint is simple last offset + 1 is least offset of next batch

# Cloudera, Inc. All rights reserved.Direct Kafka DStream(Almost) Exactly once processingAt the end of each interval, the RDD can provide information about the starting and ending offsetThese offsets can be persisted, so even on failure recover from thereEdge cases are possible and can cause duplicatesFailure in the middle of HDFS writes -> duplicates!Failure after processing but before offsets getting persisted -> duplicates!More likely!Writes to Kafka also can cause duplicates, so do reads from KafkaFix: You app should really be resilient to duplicates

# Cloudera, Inc. All rights reserved.Spark Streaming Use-CasesReal-time dashboards Show approximate results in real-timeReconcile periodically with source-of-truth using SparkJoins of multiple streamsTime-based or count-based windowsCombine multiple sources of input to produce composite dataRe-use RDDs created by Streaming in other Spark jobs.

# Cloudera, Inc. All rights reserved.What is coming?Better Monitoring and alertingBatch-level and task-level monitoringSQL on StreamingRun SQL-like queries on top of Streaming (medium long term)Python!Limited support already available, but more detailed support comingMLMore real-time ML algorithms

# Cloudera, Inc. All rights reserved.Current Spark project status400+ contributors and 50+ companies contributingIncludes: Databricks, Cloudera, Intel, Huawei, Yahoo! etcDozens of production deploymentsSpark Streaming Survived Netflix Chaos Monkey production ready!Included in CDH!

# Cloudera, Inc. All rights reserved.More Info..CDH Docs: http://www.cloudera.com/content/cloudera-content/cloud