Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)

Spark and Spark Streaming @ Netflix Kedar Sadekar & Monal Daxini

Mission •  Enable rapid pace of innovation for

Algorithm Engineers

•  Business Value – More A/B tests

Experiments

Users with plays Feature selection Large sample size

Turn back time Multiple ideas

Use Cases

Feature Selection

Feature Generation

Model Training

Metric Evaluation

Use Cases

More Users

Faster Iteration

InteractiveTurn back time

Solution – Netflix BDAS

Netflix BDAS

Notebooks• Zeppelin / iPython

• Inline Graphs / REPL

Prana Sidecar• Netflix Ecosystem

Berkeley BDAS• Faster Compute• Scale Users• Access to S3 / Hive

Netflix BDAS - Features •  Simplicity

–  Individual cluster •  Prana - Netflix ecosystem

–  Automatic configuration –  Classloader isolation –  Discovery & healthcheck

Netflix BDAS - Features •  Ad-hoc experimentation •  Time machine functionality •  Access to Hive data and micro services from single place

–  Access to multiple AWS buckets (S3)

Netflix BDAS – Sample Deployment

Wins •  8X the number of users

•  5x - 9x faster

•  Interactive

•  Turn back time

Learnings •  Easy to bring down an online system •  Almost killed 1000’s of ETL jobs

- hive metastore update •  Too many systems and configuration •  Playing catch up with libraries and tools

- Hadoop, iPython, Zeppelin •  Scala / Spark learning curve •  Debugging

- files open, no resources etc.

Increased Adoption Adoption increasing amongst teams •  Multiple Algorithmic Eng. teams •  Personalization Infrastructure •  Marketing •  Security •  A/B Test Engineering

Looking Ahead •  Spark-R / Dataframes support •  Multi-tenancy

–  Job specific configurations •  Debuggability •  Newer notebooks •  Spark Streaming

–  Lambda Architecture –  Real-time algorithms (trending now)

Netflix Streaming Event Data Pipeline

Monal Daxini

Netflix Event Data Pipeline

Event Streams

Stream processing

Publish

Collect

Process

Events @ Cloud Scale

450 Billion Events per Day

8 Million (17 GB) per sec peak

S3 EMR

Event Producer

Fronting Kafka

Consumer Kafka

At least Once Processing

Stream Consumers

Mantis

450 Billion events / day

S3 EMR

Event Producer

Fronting Kafka

Consumer Kafka

Stream Consumers

Mantis

S3 EMR

Event Producer

Fronting Kafka

Consumer Kafka

Stream Consumers

Mantis

S3 EMR

Event Producer

Fronting Kafka

Consumer Kafka

Stream Consumers

Mantis

Backpressure

Direct API for Kafka

Improve Cloud Multi-tenancy

What’s Missing?

Backpressure + JobScheduler JobGenerator DStreamGraph

Driver

00 : 00 00 : 01 00 : 02 00 : 03 00 : 04 00 : 05 00 : 06 00 : 07 00 : 08 00 : 09 00 : 10

Unbounded …

Ops Queue

Backpressure

SPARK-7398 – Add backpressure to Spark streaming

SPARK-6691 – Add dynamic rate limiter to Spark streaming

Backpressure + Backpressure implementation slated for

Spark 1.5 release

Spark 1.3

Kafka Integration

Spark 1.2 Receiver based Kafka Integration

2x Faster

Prefetch messages

Connection reuse (pooling)

Enhance

Cloud Multi-tenancy ì

Cloud Scheduler

Mesos Framework

Mesos Slave

Docker

Executor

Task Task

Docker

Executor

Task Task

Mesos Slave

Docker

Spark Driver

Improve

Measuring Spark Streaming Latencies

Spark Streaming Cloud Multi-tenancy

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)

Data & Analytics

Transcript of Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)

Spark Architecture · Spark Architecture Spark Shuffle ... Spark Shuffle Spark DataFrames . ... – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext

Data Models for Participating Programs - Utahcharm.health.utah.gov/pdf/Data_Models_for_Participating_Programs.pdfStephen W. Clyde, Ph.D. Monal Daxini ... EHDI Recommended action(s)

Spark streaming , Spark SQL

Spark Infrastructure - Australian Securities Exchange · Spark Infrastructure represents Spark Infrastructure Trust and its consolidated entities. Spark Infrastructure RE Limited

Spark SQL | Apache Spark

Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel

Intro to Spark and Spark SQL

Digital Alarm Clock - retawprojects.comretawprojects.com/uploads/FPGA_Alarmclock.pdf · Digital Alarm Clock EHD FPGA Project By: 200801205 Pavan Daxini 200801206 Parth Rao 200801207

Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

What is SPARK? - UHgabriel/courses/cosc6339_s17/BDA_11_Spark.pdf · What is SPARK? •In-Memory Cluster ... –Hadoop, –Mesos, •Spark ... Spark Essentials •Spark program has

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Spark Your Legacy (Spark Summit 2016)

Budapest Spark Meetup - Apache Spark @enbrite.ly

Bosch Industrial Spark Plugs: The Right Spark Plug for ... · Bosch Industrial Spark Plugs: The Right Spark Plug ... series spark plugs dissipates ... Introducing the new Double Ir

Spark Plugs - · PDF fileSpark Plugs Catalogue: Issue 3.2. AC-DELCO ... Spark Plug Specifications ACDelco Regular Spark Plugs Spark Plug ... SPARK PLUGS. An ideal plug for vehicles

Spark Plug Thread Repair Spark Plug Spark Plug Sockets for ...

Spark and Spark SQL - Amir H. Payberah · Spark and Spark SQL Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71

BDX 2016- Monal daxini @ Netflix

Spark Concepts - Spark SQL, Graphx, Streaming