Distributed monitoring

Leon TorresOctober 15, 2014

Web Startup Challenges

• Low-friction development

• Hodgepodge of technologies

• Hodgepodge of infrastructures

• Legacy support

• Constant migrations and upgrades

• Bottom line:

High rate of change and no time to check!

A Gordian Knot

• How utilized is our Hadoop cluster?

• How utilized is our DC?

• Are all of our services running correctly?

• Is our latency OK at every layer in the stack?

• Someone changed something, were there any negative ripple effects?

• Are we hitting any scaling issues?

A Network Knot

• Our products live on the internet

• Our data centers are global– Some of them are virtual

• Network effects are a fact of life– Network partitions

– Latency makes information late

– Noise is natural and frequent

– Data just goes missing

– High availability compounds the problem

– Richard W. Hamming

Solution Design

• Hypothesize existence of

system statea time varying stream of state components

• Build it by measuring our systems in toto

• Stream all measurements to one place

• Gain insight by inspecting this stream computationally and ad-hoc

Separation of Concerns

• State collection

• State computation

• State visualization

Collecting Sate

• Define a state event ADT capturing:

– Host

– Service

– State

– Timestamp

– Any additional key/value fields

• Find something to collect it

Riemann

• Riemann accepts state events as a stream

• Riemann indexes the stream, provides stream processing facilities and some alerting tools

• Also provides downstream pipes:

– Unix domain sockets

– Web sockets

– Graphite stream comes free

– Create your own

http://riemann.io/concepts.html

Innternal State Relays

• Poll third party monitors for state

• Map to Riemann events

• Send to Riemann

• Fill in holes with custom monitors

– Hadoop jobs, load balancer state, etc.

• Foundation in place to know everything about our global DC state

Network Monitors

• Static monitors around the world

– Constantly check HTTP state of services

• Poll third party monitors (Pingdom, etc.)

• Deduce network state from aggregate streams

• Detect outages from user perspective

• Can extend with phantomjs to get Gomez like waterfall and do whatever we want!

Demo Time

• Ad hoc demo

– Grep the stream

– Quickly analyze state of disk utilization

• Hadoop global state

– It just pipes nagios data!

• Network monitoring demo

– Let’s combine pingdom + network monitors

– And iterate! awesome dashboard

http://cyclops.vpc.supplyframe.com:3000/grep

http://cyclops.vpc.supplyframe.com:3000/disk

http://new-graphite.vpc.supplyframe.com/dashboard/#riemann-hadoop

http://cyclops.vpc.supplyframe.com:3000/pingdom

http://cyclops.vpc.supplyframe.com:3001/

Distributed Gotchas

• Riemann can scale, but some nasty surprises

– Events on a TCP connection are processed serially

– If event rate gets too high, stream gets saturated and backs up into OS network buffers, then into Netty’s unbounded buffers. This ultimately starves heap and crashes Riemann.

– Solution is to use large connection pools at the clients that push events

Distributed Gotchas

• Network outages and partitions are difficult

– Riemann must not go down

– Riemann must deal with split-brain

• Highly available SRE solution planned

– Virtual ip, heartbeat (similar to LB solution)

• Riemann servers in separate locations

– End up with two masters on partition => double the alerts but at least we get something

Are we cutting the knot?

Distributed monitoring

Technology

Transcript of Distributed monitoring