Netflix viewing data architecture evolution - EBJUG Nov 2014

Post on 02-Jul-2015

1.719 views 1 download

description

Netflix's architecture for viewing data has evolved as streaming usage has grown. Each generation was designed for the next order of magnitude, and was informed by learnings from the previous. From SQL to NoSQL, from data center to cloud, from proprietary to open source, look inside to learn how this system has evolved. (slides from a talk given at the East Bay Java Users Group MeetUp in Nov 2014)

Transcript of Netflix viewing data architecture evolution - EBJUG Nov 2014

Who am I?

Philip Fisher-Ogden• Director of Engineering @

Netflix

• Playback Services (making “click play” work)

• 6 years @ Netflix, from 10 servers to 10,000s

Story

Netflix streaming – 2007 to present

Device Growth

20071 device

200810s of devices

200910s of devices

2010100s of devices

2011+1000+ devices

Experience Evolution

Subscribers & Viewing

53M global subscribers

50 countries

>2 billion hours viewed per month

Internet Traffic

Improved Personalization

Better Experience

Viewing

Virtuous Cycle

Viewing Data

Who, What, When, Where, How Long

Real time data use cases

What have I watched?

Real time data use cases

Where was I at?

Real time data use cases

What else am I watching?

Session Analytics

Session Analytics

Architecture Evolution

Guiding Lights

• “Design for ~10X, but plan to rewrite before ~100X”

– Jeff Dean from Google

Guiding Lights

• "Architecture should match the problem - don't over engineer from the start; evolve as you grow”

@randyshoup

Guiding Lights

• "If you don't end up regretting your early technology decisions, you probably over-engineered”

@randyshoup

Architecture Patterns

• Service oriented

• Command Query Responsibility Segregation

• Event Sourcing

• Polyglot Persistence

Service Oriented

• Encapsulated domain

– Models, Logic, Persistence

• Service Interface

• Monolith -> Microservices

– Evolutionary Design

CQRS

• Separate Commands (updates) from Queries (reads)

• Different conceptual model for write vs. read

Event Sourcing

• Persist immutable events, not updatable state

• Replay events to determine state

• Optimize via snapshots, materialized views

Polyglot Persistence

• Different persistence technology for different use cases

• Flexibility vs. Complexity cost trade-off

Active Sessions

Last Position

Viewing History

Data Feed

Generic Architecture

Start Stop

Collect

ProcessEvent

StreamStream State

Session Summary

Provide

Conceptual Data Model

ViewRecordKey

CustomerID

Movie

Device

Start Timestamp

Vie

w 1 1

Conceptual Data Model

ActiveSession

ViewRecordKey

SessionDetails SessionDetails

Source of Play

Start Position

Latest Position

Latest Duration

Last Update Timestamp

1 1

Vie

w

1

1

Conceptual Data Model

EventLog

ViewRecordKey

EventType

EventTimestamp

EventDetails

Vie

w 1 0..*

Conceptual Data Model

EventLog

ViewRecordKey

EventType

EventTimestamp

EventDetails

Vie

w

ViewRecordKey

CustomerID

Movie

Device

Start Timestamp

EventLog

ViewRecordKey

EventType

EventTimestamp

EventDetails

EventLog

ViewRecordKey

EventType

EventTimestamp

EventDetails

EventLog

ViewRecordKey

EventType

EventTimestamp

EventDetails

SessionDetails

Source of Play

Start Position

Latest Position

Latest Duration

Last Update Timestamp

Summarize

Conceptual Data Model

EventLog

ViewRecordKey

EventType

EventTimestamp

EventDetails

Vie

wViewRecordKey

CustomerID

Movie

Device

Start Timestamp

EventLog

ViewRecordKey

EventType

EventTimestamp

EventDetails

EventLog

ViewRecordKey

EventType

EventTimestamp

EventDetails

EventLog

ViewRecordKey

EventType

EventTimestamp

EventDetails

SessionDetails

Source of Play

Start Position

Latest Position

Latest Duration

Last Update Timestamp

ViewingRecord

ViewRecordKey

Duration

Position

Last Modified Timestamp

Position

CustomerID

Movie

Latest Position

Conceptual Data Model

ViewingRecord

ViewRecordKey

Duration

Position

Last Modified Timestamp

Vie

win

g H

isto

ry

ViewingRecord

ViewRecordKey

Duration

Position

Last Modified Timestamp

ViewingRecord

ViewRecordKey

Duration

Position

Last Modified Timestamp

ViewingRecord

ViewRecordKey

Duration

Position

Last Modified Timestamp

ViewingRecord

ViewRecordKey

Duration

Position

Last Modified Timestamp

ViewingRecord

ViewRecordKey

Duration

Position

Last Modified Timestamp

ViewingRecord

ViewRecordKey

Duration

Position

Last Modified Timestamp

ViewingRecord

ViewRecordKey

Duration

Position

Last Modified Timestamp

Late

st P

osi

tio

ns

Position

CustomerID

Movie

Latest Position

Position

CustomerID

Movie

Latest Position

Position

CustomerID

Movie

Latest Position

Position

CustomerID

Movie

Latest Position

Position

CustomerID

Movie

Latest Position

Position

CustomerID

Movie

Latest Position

Position

CustomerID

Movie

Latest Position

CustomerID CustomerID

Command Use CasesAction Operation Key DataSet

Start Insert ViewRecordKey ActiveSessionViewingRecord

Continue (heartbeat)

Update ViewRecordKey ActiveSession

Log Insert ViewRecordKey EventLog

Stop Update ViewRecordKey ActiveSessionViewingRecord

Snapshot Insert/Update CustomerID ViewingHistory

Positions

Query Use CasesQuery Operation Key DataSet

Currently watching? Select/Read ViewRecordKey ActiveSession

Current position? Select/Read ViewRecordKey ActiveSession

CustomerID Positions

All positions? Select/Read CustomerID Positions

All history? Select/Read CustomerID ViewingHistory

Architecture Evolution

• Different generations

• Pain points & learnings

• Re-architecture motivations

Real Time Data

2007 2009 20102008 2011 2012 2013 2014 Future

SQL

No

SQL

Cac

hin

g

redismemcached

Real Time Data – gen 1

2007 2009 20102008 2011 2012 2013 2014 Future

SQL

No

SQL

Cac

hin

g

redismemcached

Real Time Data – gen 1

Start Stop

SessionsLogs / Events

History / Position

SQL

Real Time Data – gen 1 pain points

• Scalability

– DB scaled up not out

• Event Data Analytics

– ad hoc

• Fixed schema

Real Time Data – gen 2

2007 2009 20102008 2011 2012 2013 2014 Future

SQL

No

SQL

Cac

hin

g

redismemcached

Real Time Data – gen 2 motivations

• Scalability

– Scale out not up

• Flexible schema

– Key/value attributes

• Service oriented

Real Time Data – gen 2

Start Stop

No

SQL

50 data partitions

Viewing Service

Real Time Data – gen 2 pain points

• Scale out

– Resharding was painful

• Performance

– Hot spots

• Disaster Recovery

– SimpleDB had no backups

Real Time Data – gen 3

2007 2009 20102008 2011 2012 2013 2014 Future

SQL

No

SQL

Cac

hin

g

redismemcached

Real Time Data – gen 3 landscape

• Cassandra 0.6

• Before SSDs in AWS

• Netflix in 1 AWS region

Real Time Data – gen 3 motivations

• Order of magnitude increase in requests

• Scalability

– Actually scale out rather than up

Real Time Data – gen 3

Vie

win

g Se

rvic

e

StatefulTier

0

1

n-2

n-1

Active Sessions

Latest Positions

View Summary

StatelessTier

(fallback)

Sessions

Viewing History

Mem

cach

ed

Real Time Data – gen 3 writes

Vie

win

g Se

rvic

e

StatefulTier

0

1

n-2

n-1

Start

Stop

Real Time Data – gen 3 writes

Vie

win

g Se

rvic

e

StatefulTier

0

1

n-2

n-1

Active Sessions

Latest Positions

View Summary

Start

Stop

Real Time Data – gen 3 writes

Vie

win

g Se

rvic

e

StatefulTier

0

1

n-2

n-1

Active Sessions

Latest Positions

View Summary

Start

Stop

update

Real Time Data – gen 3 writes

Vie

win

g Se

rvic

e

StatefulTier

0

1

n-2

n-1

Active Sessions

Latest Positions

View Summary

Start

Stop

snapshot

Sessions

Real Time Data – gen 3 writes

Vie

win

g Se

rvic

e

StatefulTier

0

1

n-2

n-1

Active Sessions

Latest Positions

View Summary

Start

Stop

Viewing History

Mem

cach

ed

Real Time Data – gen 3 writes

Vie

win

g Se

rvic

e

StatefulTier

0

1

n-2

n-1

Active Sessions

Latest Positions

View Summary

Start

Stop

Viewing History

Mem

cach

ed

Sessions

update

Real Time Data – gen 3 writes

Vie

win

g Se

rvic

e

StatefulTier

0

1

n-2

n-1

StatelessTier

(fallback)

Sessions

Viewing History

Mem

cach

ed

Real Time Data – gen 3 reads

Vie

win

g Se

rvic

e

StatelessTier

What have I

watched?

Viewing History

Mem

cach

ed

Real Time Data – gen 3 reads

Vie

win

g Se

rvic

e

StatefulTier

Latest PositionsWhere

was I at?

Viewing History

StatelessTier

(fallback)

Mem

cach

ed

Real Time Data – gen 3 reads

Vie

win

g Se

rvic

e

StatefulTier

What else am I

watching?

Active Sessions

Architecture Patterns - Discuss

• Service oriented

• Command Query Responsibility Segregation

• Event Sourcing

• Polyglot Persistence

Real Time Data – gen 3

Vie

win

g Se

rvic

e

StatefulTier

0

1

n-2

n-1

Active Sessions

Latest Positions

View Summary

StatelessTier

(fallback)

Sessions

Viewing History

Mem

cach

ed

gen 3 - Requests ScaleOperation Scale

Create (start streaming) 1,000s per second

Update (heartbeat, close) 100,000s per second

Append (session events/logs) 10,000s per second

Read viewing history 10,000s per second

Read latest position 100,000s per second

gen 3 – Cluster ScaleCluster Scale

Cassandra Viewing History ~100 hi1.4xl nodes~48 TB total space used

Viewing Service Stateful Tier ~1700 r3.2xl nodes50GB heap memory per node

Memcached ~450 r3.2xl/xl nodes~8TB memory used

Real Time Data – gen 3 pain points

• Stateful tier

– Hot spots

– Multi-region complexity

• Monolithic service

• read-modify-write poorly suited for memcached

Real Time Data – gen 3 learnings

• Distributed stateful systems are hard

– Go stateless, use C*/memcached/redis…

• Decompose into microservices

Real Time Data – gen 4

Vie

win

g Se

rvic

e

StatefulTier

0

1

n-2

n-1

Active Sessions

Latest Positions

View Summary

StatelessTier

(fallback)

Viewing History

Sessions

Mem

cach

ed

Real Time Data – gen 4

(Work in progress)

Microservices:Components as Services

Collector

Processor

Provider

Events

Queries

Microservices:Decoupled Communication

Collector Processor Provider

Events Materialized Views

Signals

Request Processing Design

• Dimensions:

– Response required?

– Latency target?

– Where?

• In process

• Remote process

Request Processing Design

Low-latencytasks

Medium-latencyasync tasks

High-latencyasync tasks

Response Required

Latency Low

Where In-process

Request Processing Design

Low-latencytasks

Medium-latencyasync tasks

High-latencyasync tasks

Response Not required

Latency Medium

Where In-process

Request Processing Design

Low-latencytasks

Medium-latencyasync tasks

High-latencyasync tasks

Response Not required

Latency High

Where Remoteprocess

Start Streaming Example

Start Streaming

Start

Stop

Low-latencytasks

Medium-latencyasync tasks

Viewing History

Sessions

High-latencyasync tasks

Start Streaming

Start

Stop

Low-latencytasks

Viewing History

Sessions

Start Streaming

Viewing History

Sessions

Check Active Sessions within Account Limits

Start Streaming

Viewing History

Sessions

Persist session

Start Streaming

Viewing History

SessionsEnqueueSave to Viewing

History

Start Streaming

Viewing History

Sessions

Within limit,respond OK.

Save to Viewing History

Asynchronous

Session Interactions

Start

Stop

Collectors* | Processors*

Viewing History

Session Events

Positions

Session Summary Example

• End playback

• Summarize session

Session Summary

Start

Stop

Low-latencyblocking tasks

Medium-latencyasync tasks

Session Summarizer

High-latencyasync tasks

Collector

Processor

Session Summary

Start

StopHigh-latencyasync tasks

Collector

Processor

Session Summarizer

Session Summary

Start

StopHigh-latencyasync tasks

Processor

Collector

Session Summarizer

Session Summary

Session Events

Session Summarizer

Retrieve by Session Key

Session Summary

Session Events

Session Summarizer

Retrieve by Session Key

Session Summary

Session Events

Session Summarizer

Order

Session Summary

Session Events

Session Summarizer

Summarize

Viewing History

Positions

Architecture Patterns - Discuss

• Service oriented

• Command Query Responsibility Segregation

• Event Sourcing

• Polyglot Persistence

Takeaways

• Architectural Patterns

• Evolutionary Design

– Evolve as you grow

• Re-architect for order of magnitude shifts

Questions?

@philip_pfo

Feedback?

@philip_pfo

Thanks!

@philip_pfo