Distributed monitoring

18
Leon Torres October 15, 2014

Transcript of Distributed monitoring

Page 1: Distributed monitoring

Leon TorresOctober 15, 2014

Page 2: Distributed monitoring

Web Startup Challenges

• Low-friction development

• Hodgepodge of technologies

• Hodgepodge of infrastructures

• Legacy support

• Constant migrations and upgrades

• Bottom line:

High rate of change and no time to check!

Page 3: Distributed monitoring
Page 4: Distributed monitoring

A Gordian Knot

• How utilized is our Hadoop cluster?

• How utilized is our DC?

• Are all of our services running correctly?

• Is our latency OK at every layer in the stack?

• Someone changed something, were there any negative ripple effects?

• Are we hitting any scaling issues?

Page 5: Distributed monitoring

A Network Knot

• Our products live on the internet

• Our data centers are global– Some of them are virtual

• Network effects are a fact of life– Network partitions

– Latency makes information late

– Noise is natural and frequent

– Data just goes missing

– High availability compounds the problem

Page 6: Distributed monitoring
Page 7: Distributed monitoring
Page 8: Distributed monitoring

– Richard W. Hamming

Page 9: Distributed monitoring

Solution Design

• Hypothesize existence of

system statea time varying stream of state components

• Build it by measuring our systems in toto

• Stream all measurements to one place

• Gain insight by inspecting this stream computationally and ad-hoc

Page 10: Distributed monitoring

Separation of Concerns

• State collection

• State computation

• State visualization

Page 11: Distributed monitoring

Collecting Sate

• Define a state event ADT capturing:

– Host

– Service

– State

– Timestamp

– Any additional key/value fields

• Find something to collect it

Page 12: Distributed monitoring

Riemann

• Riemann accepts state events as a stream

• Riemann indexes the stream, provides stream processing facilities and some alerting tools

• Also provides downstream pipes:

– Unix domain sockets

– Web sockets

– Graphite stream comes free

– Create your own

Page 13: Distributed monitoring

Innternal State Relays

• Poll third party monitors for state

• Map to Riemann events

• Send to Riemann

• Fill in holes with custom monitors

– Hadoop jobs, load balancer state, etc.

• Foundation in place to know everything about our global DC state

Page 14: Distributed monitoring

Network Monitors

• Static monitors around the world

– Constantly check HTTP state of services

• Poll third party monitors (Pingdom, etc.)

• Deduce network state from aggregate streams

• Detect outages from user perspective

• Can extend with phantomjs to get Gomez like waterfall and do whatever we want!

Page 15: Distributed monitoring

Demo Time

• Ad hoc demo

– Grep the stream

– Quickly analyze state of disk utilization

• Hadoop global state

– It just pipes nagios data!

• Network monitoring demo

– Let’s combine pingdom + network monitors

– And iterate! awesome dashboard

Page 16: Distributed monitoring

Distributed Gotchas

• Riemann can scale, but some nasty surprises

– Events on a TCP connection are processed serially

– If event rate gets too high, stream gets saturated and backs up into OS network buffers, then into Netty’s unbounded buffers. This ultimately starves heap and crashes Riemann.

– Solution is to use large connection pools at the clients that push events

Page 17: Distributed monitoring

Distributed Gotchas

• Network outages and partitions are difficult

– Riemann must not go down

– Riemann must deal with split-brain

• Highly available SRE solution planned

– Virtual ip, heartbeat (similar to LB solution)

• Riemann servers in separate locations

– End up with two masters on partition => double the alerts but at least we get something

Page 18: Distributed monitoring

Are we cutting the knot?