Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using...

61
Real Time Recommendations using Spark Streaming Elliot Chow

Transcript of Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using...

Page 1: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Real Time Recommendations using Spark Streaming

Elliot Chow

Page 2: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Why?

- React more quickly to changes in interest- Time-of-day effects- Real-world events

Page 3: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...
Page 4: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Data Systems

Stream Processing

Recommendation Systems

UI

Feedback Loop

Page 5: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Trends Data

- What people browse: impressions- What people watch: plays

Page 6: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Trends Data - Impressions

Appearance of a video in the viewport

Page 7: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Trends Data - Plays

Member plays a video

Page 8: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Why Spark Streaming?

- Existing Spark infrastructure- Experience with Spark- Batch and Streaming

Page 9: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Components

Page 10: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Join CassandraAggregate S3Transform

Filter

FilterConsume Impressions

Consume Plays

Design

Page 11: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Join CassandraAggregateS3

Transform

Filter

FilterConsume Impressions

Consume Plays

Design

Page 12: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Join Key

“Request Id” - a unique identifier of the source of a play or impression

Page 13: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Design

JoinCassandraAggregate S3Transform

Filter

FilterConsume Impressions

Consume Plays

Page 14: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Output

Video Epoch Plays Impressions

Stranger Things 1 (00:00-00:30) 4 5

Stranger Things 1 (00:00-00:30) 3 6

House Of Cards 2 (00:30-01:00) 8 10

Marseille 2 (00:30-01:00) 3 3

Page 15: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

- Instead of raw counts, output sets of request ids- Count = cardinality of the set of request ids

- Idempotent counting

Output

Page 16: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Join CassandraAggregateS3Transform

Filter

FilterConsume Impressions

Consume Plays

Design

Page 17: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Programming with Spark Streaming

Page 18: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins

Page 19: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Time

- Time to browse and select a video- Batched logging from client application- Delays in data sources

Page 20: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

- Window both plays and impressions by epoch duration- Join the two windows together- Slide by epoch duration

Streaming Joins - Attempt I

t

Plays

Impressions

Page 21: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt I

- Easy to implement- Tight coupling with processing time- Does not mesh well with absolute time windows- Failure can mean loss of all data for the entire window

Epoch 1

Window Start00:15

Epoch 2

Window End00:45

00:00 00:30 01:00

Page 22: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

- Join using mapWithState- Join key is the mapWithState key - State is the plays and impressions sharing the same join key- Use timeouts to expire unjoined data

Page 23: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

R1, I1

Plays & Impressions

MapWithStateRDD

Page 24: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

R1 => { I1 }R1, I1

Plays & Impressions

MapWithStateRDD

Page 25: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

R1 => { I1 }R2, I8

Plays & Impressions

MapWithStateRDD

Page 26: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

R1 => { I1 }R2 => { I8 }R2, I8

Plays & Impressions

MapWithStateRDD

Page 27: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

R1 => { I1 }R2 => { I8 }R1, P1

Plays & Impressions

MapWithStateRDD

Page 28: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

R1 => { I1, P1 }R2 => { I8 }

R1, P1R1, I1R1, P1

Plays & Impressions

MapWithStateRDD

Page 29: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

R1 => { I1, P1 }R2 => { I8 }R3, I5

Plays & Impressions

MapWithStateRDD

Page 30: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

R1 => { I1, P1 }R2 => { I8 }R3, I5R3 => { I5 }

Plays & Impressions

MapWithStateRDD

Page 31: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

R1 => { I1, P1 }R1, I6

R3 => { I5 }

Plays & Impressions

MapWithStateRDD

Page 32: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

R1 => { I1, P1, I6 }R1, I6

R3 => { I5 }

R1, I6

Plays & Impressions

MapWithStateRDD

Page 33: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

R1 => { I1, P1, I6 }...

R3 => { I5 }

Plays & Impressions

MapWithStateRDD

Page 34: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt II

- Make progress every batch- Too much “uninteresting” data- High memory usage- Large checkpoints

Page 35: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - An Observation

t

Plays

Impressions

Page 36: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - An Observation

t

Plays

Impressions

Page 37: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - An Observation

t

Plays

Impressions

Join incoming batch of plays to windowed impressions, and vice versa

Page 38: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - An Observation

t

Plays

Impressions

Slide by batch interval...

Page 39: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - An Observation

t

Plays

Impressions

Slide by batch interval again...

Page 40: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Streaming Joins - Attempt III

- Counts are updated every batch- Uses Spark’s windowing - No checkpoints

Page 41: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

mapWithState

Page 42: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

mapWithState

- Can be used for more than sessionization

Page 43: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

mapWithState

- Can be used for more than sessionization- Be aware of cache evictions

- Lots of state may need to be recomputed

Page 44: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

mapWithState

val input: DStream[(VideoId, RequestId)] = // ...

val spec: StateSpec[VideoId, RequestId, Set[RequestId], (VideoId, Set[RequestId])] = // ...

val output: DStream[(VideoId, Set[RequestId])] = { input. mapWithState(spec)}

Page 45: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

mapWithStateval input: DStream[(VideoId, RequestId)] = // ...

val spec: StateSpec[VideoId, RequestId, Set[RequestId], (VideoId, Set[RequestId])] = // ...

val output: DStream[(VideoId, Set[RequestId])] = { input. mapWithState(spec). groupByKey. mapValues(_.maxBy(_.size))}

Page 46: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

mapWithStateval input: DStream[(VideoId, RequestId)] = // ...

val spec: StateSpec[VideoId, Iterable[RequestId], Set[RequestId], (VideoId, Set[RequestId])] = // ...

val output: DStream[(VideoId, Set[RequestId])] = { input. groupByKey. mapWithState(spec)}

Page 47: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

mapWithStateval input: DStream[(VideoId, RequestId)] = // ...

val spec: StateSpec[VideoId, RequestId, Set[RequestId], Unit] = // ...

val output: DStream[(VideoId, Set[RequestId])] = { input. mapWithState(spec). stateSnapshots}

Page 48: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Productionizing Spark Streaming

Page 49: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Metrics

- Monitoring system health- Aid in diagnosis of issues- Needs to be performant and accurate

Page 50: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Metrics - Option I

- Use “traditional” stream processing metrics - Events/second, bytes/second, …

- Batching can make numbers hard to interpret- Susceptible to recomputation

Page 51: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Metrics - Option II

- Spark Accumulators- Used internally by Spark- Susceptible to recomputation- Unclear when to report the metric

- Can make use of SparkListener & StreamingListener

Page 52: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Metrics - Option III

- Explicit counts on RDDs- Counts will be accurate- Additional latency- Use caching to prevent duplicate work*

Page 53: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Metrics

- Processing time < Batch interval- Time the different parts of the job

- Spark is lazy - may require forcing evaluation- Use Spark UI metrics

Page 54: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Error Handling

- What exceptions cause the streaming job to crash?

Page 55: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Error Handling

- What exceptions cause the streaming job to crash?- Most seem to be caught to keep the job running

- Exception handling is application-specific- Stop-gap: track the elapsed time since the batch started

Page 56: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Future Work

Page 57: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Future Work

- Red/Black deployment with zero data-loss

Page 58: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Future Work

- Red/Black deployment with zero data-loss- Auto-scaling

Page 59: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Future Work

- Red/Black deployment with zero data-loss- Auto-scaling- Improved back pressure per topic

Page 60: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Future Work

- Red/Black deployment with zero data-loss- Auto-scaling- Improved back pressure per topic- Updating broadcast variables

Page 61: Real Time Recommendations using Spark Streaming · 2017-02-02 · Real Time Recommendations using Spark Streaming Elliot Chow. Why? - React more quickly to changes in interest ...

Questions?

We’re hiring!

[email protected]