Monitorama 2015 Monitoring OpenConnect CDN

Post on 28-Jul-2015

1.217 views 0 download

Tags:

Transcript of Monitorama 2015 Monitoring OpenConnect CDN

Monitoring OpenConnect CDN

Sergey Fedorov, NetflixMonitorama 2015

Sergey Fedorov, Netflix, Monitorama 2015

What is OpenConnect

36.5%

US downstream traffic *

* 2015 Sandvine reportSergey Fedorov, Netflix, Monitorama 2015

OpenConnect Cache Appliance

Space/Power optimized10/40Gbs network interfaceFreeBSD OSNGinx serverBird routing proxy

Gizmodo, “This box can hold an entire Netflix” http://gizmodo.com/this-box-can-hold-an-entire-netflix-1592590450Sergey Fedorov, Netflix, Monitorama 2015

Network

Transit

Internet Exchange

ISP embedded

Sergey Fedorov, Netflix, Monitorama 2015

Sergey Fedorov, Netflix, Monitorama 2015

Intelligent clients

Control Plane

end-user content request router

client locationnetwork conditionsserver utilizationcontent distribution

Sergey Fedorov, Netflix, Monitorama 2015

Who we are

Sergey Fedorov Stefan PraszalowiczSergey Fedorov, Netflix, Monitorama 2015

Monitoring challenge

Testing in prod*

Network changesFirmware deploymentsApp pushesUpdating content...

Sergey Fedorov, Netflix, Monitorama 2015

Sergey Fedorov, Netflix, Monitorama 2015

CachesClients

Control Plane

Microservices

Network

Capacity

Config

Content

Telemetry (Atlas)Logs (ElasticSearch)

Data sources

METRICS

Something breaks all the time

Big problems start small

Context matters

Sergey Fedorov, Netflix, Monitorama 2015

Sergey Fedorov, Netflix, Monitorama 2015

Small SRE team

Elastic

How we do it

Netflix Clients Caches Network ConfigData sources ......

...

Sergey Fedorov, Netflix, Monitorama 2015

Netflix Clients Caches Network ConfigData sources ......

...

Orchestration

Data processing

stream processorspollers

Sergey Fedorov, Netflix, Monitorama 2015

FSMState processing

Netflix Clients Caches Network ConfigData sources ......

...

Orchestration

Data processing

stream processorspollers

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

start fixing end fixing

action: okfrom: cpu

threshold=75%

MAINTENANCE

Sergey Fedorov, Netflix, Monitorama 2015

start fixing end fixing

action: okfrom: cpu

threshold=75%

MAINTENANCE

Sergey Fedorov, Netflix, Monitorama 2015

start fixing end fixing

action: okfrom: cpu

threshold=75%

MAINTENANCE

Sergey Fedorov, Netflix, Monitorama 2015

start fixing end fixing

action: okfrom: cpu

threshold=75%

MAINTENANCE

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: silencefrom: config

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: silencefrom: config

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: unsilencefrom: config

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: start_fixfrom: user

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

action: end_fixfrom: user

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

MAINTENANCE

start fixing end fixing

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

FSMState processing

Netflix Clients Caches Network ConfigData sources ......

...

Orchestration

Data processing

stream processorspollers

Sergey Fedorov, Netflix, Monitorama 2015

FSMState processing

Netflix Clients Caches Network ConfigData sources ......

...

Orchestration

Data processing

stream processorspollers

Events processingEvent handlers

STATE TRANSITION EVENT● OLD STATE● NEW STATE● Input action● Metric name● Action metadata

○ metric value○ comments○ tags○ timestamp○ ...

Event handlers

Triggers an event

Event handlersRULES

Sergey Fedorov, Netflix, Monitorama 2015

Sergey Fedorov, Netflix, Monitorama 2015

Events priority

Escalation

Do Never

Notice

Warning

Critical

Severity

Info

Do Next

Do Last

Do Now

0 1 2 3

Notice

Warning

Critical

Severity

Info

0 1 2 3Escalation

Notice

Warning

Critical

Severity

Info

0 1 2 3

Notifications

Sergey Fedorov, Netflix, Monitorama 2015

FSMState processing

Netflix Clients Caches Network ConfigData sources ......

...

Orchestration

Data processing

stream processorspollers

Events processingEvent handlers

Aggregation

C

ClusterCache state = aggregation of states of its metrics

Cluster state = aggregation of states of its caches

OK all OK DEGRADED some BROKEN or DEGRADEDBROKEN most BROKEN

All caches are OK → cluster state is OK

Sergey Fedorov, Netflix, Monitorama 2015

Aggregation

C

Cluster OK all OK DEGRADED some BROKEN or DEGRADEDBROKEN most BROKEN

2/12 caches are BROKEN → cluster state is DEGRADED

Sergey Fedorov, Netflix, Monitorama 2015

Aggregation

C

Cluster OK all OK DEGRADED some BROKEN or DEGRADEDBROKEN most BROKEN

7/12 caches are BROKEN → cluster state is BROKEN

Sergey Fedorov, Netflix, Monitorama 2015

FSMState processing

Netflix Clients Caches Network ConfigData sources ......

...

Orchestration

Data processing

stream processorspollers

Events processingEvent handlers

Challenges

Setup

Sergey Fedorov, Netflix, Monitorama 2015

Challenges

SetupPredefined groupings

Sergey Fedorov, Netflix, Monitorama 2015

Challenges

SetupPredefined groupingsUI

Sergey Fedorov, Netflix, Monitorama 2015

Challenges

SetupPredefined groupingsUIIssues correlation

Sergey Fedorov, Netflix, Monitorama 2015

Challenges

SetupPredefined groupingsUIIssues correlationFailure forecasting

Sergey Fedorov, Netflix, Monitorama 2015

Challenges

SetupPredefined groupingsUIIssues correlationFailure forecastingOSS

Sergey Fedorov, Netflix, Monitorama 2015

Feedback

jobs.netflix.com/jobs/1693/

jobs.netflix.com/jobs/2240/

Sergey FedorovOpenConnect, Netflixsfedorov@netflix.com