Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve?...

43
Using Spark @ Conviva Spark Summit 2013

Transcript of Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve?...

Page 1: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Using Spark @ Conviva

Spark Summit 2013

Page 2: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Summary

Who are we?

What is the problem we needed to solve?

How was Spark essential to the solution?

What can Spark enable us to do in the future?

Page 3: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Some context…

Quality has a huge impact on user engagement in online video

Page 4: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

What we do

Optimize end user online video experience

Provide historical and real-time metrics for online video providers and distributers

Maximize quality through Adaptive Bit Rate (ABR) Content Distribution Network (CDN) switching

Enable Multi-CDN infrastructure with fine grained traffic shaping control

Page 5: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

About four billion streams per month

Mostly premium content providers (e.g., HBO, ESPN, BSkyB, CNTV) but also User Generated Video sites (e.g., Ustream)

Live events (e.g., NCAA March Madness, FIFA World Cup, MLB), short VoDs (e.g., MSNBC), and long VoDs (e.g., HBO, Hulu)

Various streaming protocols (e.g., Flash, SmoothStreaming, HLS, HDS), and devices (e.g., PC, iOS devices, Roku, XBOX, …)

Traffic from all major CDNs, including ISP CDNs (e.g., Verizon, AT&T)

What traffic do we see?

Page 6: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

The Truth

Video delivery over the internet is hard There is a big disconnect between viewers and publishers

Page 7: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

The Viewer’s Perspective

ISP & Home Net

Video Player

Start of the viewer’s perspective

?

Page 8: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Video Source

Encoders & Video Servers

CMS and Hosting Content Delivery

Networks (CDN)

The Provider’s Perspective

?

Page 9: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Video Source

Encoders & Video Servers

CMS and Hosting Content Delivery

Networks (CDN)

ISP & Home Net

Video Player

Closing the loop

Page 10: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

The Truth

Video delivery over the internet is hard CDN variability makes it nearly impossible to deliver high quality

everywhere with just one CDN

SIGCOMM ‘12

Page 11: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

The Truth

Video delivery over the internet is hard CDN variability makes it nearly impossible to deliver high quality all

the time with just one CDN

SIGCOMM ‘12

Page 12: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

The Idea

Where there is heterogeneity, there is room for optimization

For each viewer we want to decide what CDN to stream from

But it’s difficult to model the internet, and things can rapidly change over time

So we will make this decision based on the real-time data that we collect

Page 13: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

CDN1 (buffering ratio)

CDN2 (buffering ratio)

The Idea

Avg. buff ratio of users in Region[1] streaming

from CDN1

Avg. buff ratio of users in Region[1] streaming

from CDN2

For each CDN, partition clients by City

For each partition compute Buffering Ratio

Seatt

le

San

Fran

cisc

o

New

Yor

k

Lond

on

Hon

g Ko

ng

Page 14: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

The Idea

For each partition select best CDN and send clients to this CDN

CDN1 (buffering ratio)

CDN2 (buffering ratio)

Seatt

le

San

Fran

cisc

o

New

Yor

k

Lond

on

Hon

g Ko

ng

Page 15: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

The Idea

For each partition select best CDN and send clients to this CDN

CDN1 (buffering ratio)

CDN2 (buffering ratio)

CDN2 >> CDN1

Seatt

le

San

Fran

cisc

o

New

Yor

k

Lond

on

Hon

g Ko

ng

Page 16: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

The Idea

For each partition select best CDN and send clients to this CDN

CDN1 (buffering ratio)

CDN2 (buffering ratio)

Seatt

le

San

Fran

cisc

o

New

Yor

k

Lond

on

Hon

g Ko

ng

Page 17: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Best CDN (buffering ratio)

The Idea

CDN1 (buffering ratio)

CDN2 (buffering ratio)

For each partition select best CDN and send clients to this CDN

Seatt

le

San

Fran

cisc

o

New

Yor

k

Lond

on

Hon

g Ko

ng

Page 18: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

The Idea

Best CDN (buffering ratio)

CDN1 (buffering ratio)

CDN2 (buffering ratio)

What if there are changes in performance? Se

attle

San

Fran

cisc

o

New

Yor

k

Lond

on

Hon

g Ko

ng

Page 19: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

The Idea

Best CDN (buffering ratio)

CDN1 (buffering ratio)

CDN2 (buffering ratio)

Use online algorithm respond to changes in the network.

Seatt

le

San

Fran

cisc

o

New

Yor

k

Lond

on

Hon

g Ko

ng

Page 20: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Content owners (CMS & Origin)

CDN 1

CDN 2

CDN 3

Clients

Coordinator

Conti

nuou

s m

easu

rem

ents

Busi

ness

Pol

icie

s

control

How?

Coordinator implementing an optimization algorithm that dynamically selects a CDN for each client based on Individual client Aggregate statistics Content owner policies

All based on real-time data

Page 21: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

What processing framework do we use?

Twitter Storm Fault tolerance model affects data accuracy Non-deterministic streaming model

Roll our own Too complex No need to reinvent the wheel

Spark Easily integrates with existing Hadoop architecture Flexible, simple data model Writing map() is generally easier than writing update()

Page 22: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Spark Usage for Production

Decision Maker

Decision Maker

Decision Maker

Compute performance metrics in spark cluster

Relay performance information to decision makers

Spark ClusterSparkNode

Spark Node

Spark Node

Spark Node Spark

Node

Storage Layer

Clients

Page 23: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Results

500 700 900 1100 1300 1500 1700 1900 2100 2300 250075%

80%

85%

90%

95%

100%

Non-buffering views

Average Bitrate (Kbps)

Page 24: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Spark’s Role

Spark development was incredibly rapid, aided both by its excellent programming interface and highly active community

Expressive: Develop complex on-line ML decision based algorithm in ~1000

lines of code Easy to prototype various algorithms

It has made scalability a far more manageable problem

After initial teething problems, we have been running Spark in a production environment reliably for several months.

Page 25: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Problems we faced

Silent crashes…

Often difficult to debug, requiring some tribal knowledge

Difficult configuration parameters, with sometimes inexplicable results

Fundamental understanding of underlying data model was essential to writing effective, stable spark programs

Page 26: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Enforcing constraints on optimization

Imagine swapping clients until an optimal solution is reached

Constrained Best CDN (buffering ratio)

CDN1 (buffering ratio)

CDN2 (buffering ratio)

50%

50%

Page 27: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Enforcing constraints on top of optimization

Solution is found after clients have already joined.

Therefore we need to parameterize solution to clients already seen for online use.

Need to compute an LP on real time data

Spark Supported it 20 LPs Each with 4000 decisions variables and 350 constraints 5 seconds.

Page 28: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Tuning

Can’t select a CDN based solely on one metric. Select utility functions that best predict engagement

Confidence in a decision, or evaluation will depend on how much data we have collected Need to tune time window Select different attributes for separation

Page 29: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Tuning

Need to validate algorithm changes quickly

Simulation of algorithm offline, is essential

Page 30: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Spark Usage for Simulation

SparkNode

Spark Node

Spark Node

Spark Node

Spark Node

HDFS with Production traces

Load production traces with randomized initial decisions

Generate decision table (with artificial delay)

Produce simulated decision set

Evaluate decisions against actual traces to estimate expected quality improvement

Page 31: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

In Summary

Spark was able to support our initial requirement of fast fault tolerant performance computation for an on-line decision maker

New complexities like LP calculation ‘just worked’ in the existing architecture

Spark has become an essential tool in our software stack

Page 32: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Thank you!

Page 33: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Questions?

Page 34: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

What’s Next?

Our pilot Spark application is a success, so what’s next?

Many Spark applications in our roadmap

Unification of real-time, near-real-time, and offline stacks using Spark and Streaming Spark

Page 35: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Current Dual-Stack Architecture

Video Player Agents

Gateway

Real-time video session states

Real-time video metrics

Historical video session states

Historical video metrics

Real-time stack Offline stack

1-2 seconds

ephemeral

In-house streaming system

5min, hourly, daily, monthly

persistent

Run on Hadoop

Video events Control messages

Page 36: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

It looks good, except

The devil is in the details

Page 37: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

What Cause Us Headaches?

Streaming model is complicated

Harder to reason about than batch model: cause more bugs

Non-deterministic: harder to debug

Maintain two code bases is costly

Write code twice

Out-of-sync implementation causes discrepancy

Page 38: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Streaming Spark to The Rescue

Streaming Spark is a “streaming” system with a batch model under the hood

Short summary: Spark and Streaming Spark provide unified batch model, unified API, and unified fault tolerance model

For application, write code once, and run in both real-time and offline mode with different batch sizes

Can provide exactly-once semanticsLines of code for custom stack Lines of code for unified stack

18K real time stack, 10K Hadoop application code, 10k shared code, all in Java

5K shared Spark application code in Scala + Java

Page 39: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Every Design Decision Has A Tradeoff

Streaming Spark does a good job optimizing for small batches

However, we still see a 2x performance degradation.

More work needed on API, e.g., update config file for every batch

Better input fault tolerance especially if input is from Kafka

Dynamic batch size depending on load

Real-time Stack Performance for custom stack

Real-time Stack Performance for unified stack

200K concurrent sessions per c3.2xlarge node

100K concurrent session per c3.2xlarge node

Page 40: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

An Alternative Solution From Twitter

Twitter Storm, Trident, and SummingBird

Storm is the popular streaming framework

Trident provides batch processing and exactly-once semantics

SummingBird provides write-code-once capability

Streaming Spark is designed to do batch processing from scratch.

Twitter systems are probably more battle-tested

Page 41: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Conclusion

Simplicity rules and that’s why we like the simple batch model

Streaming Spark and Spark enable us to unify real-time, near-real-time, and offline stacks under the simple batch model

Extra resource are required, but can be justified by lower code maintenance cost and faster dev iterations

Page 42: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Thanks!

We have many projects planned on Spark platform.

Stop by our booth if you are interested.

Page 43: Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Questions?