The Dark Art of Production Alerting
-
Upload
alois-reitbauer -
Category
Technology
-
view
1.092 -
download
0
Transcript of The Dark Art of Production Alerting
![Page 1: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/1.jpg)
T h e D a r k A r t o f B u i l d i n g a P r o d u c ti o n I n c i d e n t S y s t e m
@AloisReitbauerwww.ruxit.com
![Page 2: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/2.jpg)
N o b r o ke n c a b l e s
![Page 3: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/3.jpg)
N o d a t a c e n t e r fi r e s
![Page 4: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/4.jpg)
O t h e r t h i n g s c a n h a p p e n a s w e l l
Continuous deployments
Infrastructure changes
other “everyday” stuff
![Page 5: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/5.jpg)
Scaling an incident system
![Page 6: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/6.jpg)
H o w i t f e e l s t o d o w h a t w e d o
![Page 7: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/7.jpg)
D o y o u a l e r t ?
Typical error rate of 3 percent at 10.000 transactions/min
During the night we now have 5 errors in 100 requests.
![Page 8: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/8.jpg)
D o y o u a l e r t ?
Typical response time has been around 300 ms.
Now we see response times up to 600 ms.
![Page 9: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/9.jpg)
W e a r e g o o d a t fi x i n g p r o b l e m s , b u t n o t r e a l l y g o o d
a t d e t e c ti n g t h e m .
![Page 10: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/10.jpg)
H o w c a n w e g e t b e tt e r ?.
![Page 11: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/11.jpg)
It is all about statisticsI t ’s a l l a b o u t s t a ti s ti c s
![Page 12: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/12.jpg)
Stati sti cs is about objecti vely lying to yourself
in a meaningful way.
![Page 13: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/13.jpg)
H o w t o d e s i g n a n i n c i d e n t
![Page 14: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/14.jpg)
How to calculatethis value?
I t l o o k s r e a l l y s i m p l e
Which metric to pick?
How to getthis baseline?
How to define thatthis happened?
![Page 15: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/15.jpg)
W h i c h m et r i c s to p i c k ?
![Page 16: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/16.jpg)
T h r e e t y p e s o f m e t r i c sCapacity MetricsDefine how much of a resource is used.
Discrete MetricsSimple countable things, like errors or users.
Continuous MetricsMetrics represented by a range of values at any given time.
![Page 17: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/17.jpg)
C a p a c i t y M et r i c sGood for capacity planning, not so good for production alerting
![Page 18: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/18.jpg)
C o n n e c ti o n P o o l s
![Page 19: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/19.jpg)
b ett e r u s eConnection acquisition timeTells you, whether anyone needed a connection and did not get it.
![Page 20: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/20.jpg)
C P U U s a g e
![Page 21: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/21.jpg)
b ett e r u s eCombination of Load Average and CPU usageeven better correlate the with response times of applications
![Page 22: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/22.jpg)
D i s c rete M et r i c sPretty easy to track and analyze.
![Page 23: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/23.jpg)
C o nti n u o u s M et r i c sRequire some extra work as they are not that easy to track.
![Page 24: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/24.jpg)
Conti nuous Metrics – The hope
42
![Page 25: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/25.jpg)
Conti nuous Metrics – The reality
![Page 26: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/26.jpg)
What the average tells us
![Page 27: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/27.jpg)
What the median tells us
![Page 28: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/28.jpg)
H o w to get a b a s e l i n e ?
![Page 29: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/29.jpg)
A baseline is not a numberBaselines define the range of a value combined with a probability
![Page 30: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/30.jpg)
Normal distributi on as baseline
Mean: 500 msStd. Dev.: 100 ms
68 %400ms – 600 ms
95 %300ms – 700 ms
100 200 300 400 500 600 700 800 900
99 %200ms – 800 ms
![Page 31: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/31.jpg)
T h i s c a n g o r e a l l y w r o n g
“Why alerts suck and monitoring solutions need to become better”
![Page 32: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/32.jpg)
H o w t h i s l e a d s t o f a l s e a l e r t s
![Page 33: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/33.jpg)
Many false alerts
Aggressive Baseline
![Page 34: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/34.jpg)
No alerts at all
Moderate Baseline
![Page 35: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/35.jpg)
Find the right distributi on modelHowever, this can be really hard to impossible
![Page 36: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/36.jpg)
Your distr ibuti on might look l ike this
![Page 37: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/37.jpg)
… or l ike this
![Page 38: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/38.jpg)
or completely diff erentyou never know …
![Page 39: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/39.jpg)
H o w c a n w e s o l v e t h i s p r o b l e m ?
![Page 40: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/40.jpg)
N o r m a l d i s t r i b u ti o n - a g a i n
50 Percent slower than μ
97.6 Percent slower than μ + 2σ
Median97th Percentile
![Page 41: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/41.jpg)
The 50 t h and 90 t h percenti le defi ne normal behavior
without needingto know anything about the
distributi on model
![Page 42: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/42.jpg)
Median shows the real problem
![Page 43: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/43.jpg)
H o w t o d e fi n e n o n - n o r m a l b e h a v i o r ?
![Page 44: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/44.jpg)
Fo r t u n ate l y, t h i s i s n o t t h e p ro b l e m we n e e d to s o l ve
We are only talking about missed expectations
![Page 45: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/45.jpg)
Let’s look at two scenarios
Errors
Is a certain error rate likely to happen or not?
Response Times
Is a certain increase in response time significant
enough to trigger an incident?
![Page 46: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/46.jpg)
The error rate scenarioWe have a typical error rate of 3 percent at 10.000 transactions/minute
During the night we now have 5 errors in 100 requests. Should we alert – or not?
![Page 47: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/47.jpg)
W h a t c a n w e l e a r n
![Page 48: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/48.jpg)
S t a ti s ti c s i s e v e r w h e r e
![Page 49: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/49.jpg)
B i n o m i a l D i st r i b u ti o nTells us how likely it is to see n successes in a certain number of trials
![Page 50: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/50.jpg)
H o w m a n y e r r o r s a r e o k ?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 190.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Likeliness of at least n errors
18 % probability to see 5 or more errors. Which is within 2 times Std. Deviation. We do not alert.
![Page 51: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/51.jpg)
R e s p o n s e T i m e E x a m p l eOur median response time is 300 ms
and we measure
200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms
![Page 52: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/52.jpg)
P e r c e n ti l e D r i ft
D e t e c ti o n
![Page 53: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/53.jpg)
Did the median drift signifi cantly?
Check all values above 300 ms200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms
7 values are higher than the median. Is this normal?
We can again use the Binomial Distribution
![Page 54: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/54.jpg)
A p p l y i n g t h e B i n o m i a l D i s t r i b u ti o n
We have a 50 percent likeliness to see values above the median.
How likely is is that 7 out of 10 samples are higher?
The probability is 17 percent, so we should not alert.
![Page 55: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/55.jpg)
How to calculatethis value?
… a n d w e a r e d o n e !
Which metric to pick?
How to getthis baseline?
How to define thatthis happened?
![Page 56: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/56.jpg)
This was just the beginningThere are many more use things about statistics, probabilities, testing, ….
![Page 58: The Dark Art of Production Alerting](https://reader036.fdocuments.net/reader036/viewer/2022062307/55838d71d8b42a282c8b4eca/html5/thumbnails/58.jpg)
Image Credits
http://commons.wikimedia.org/wiki/File:Network_switches.jpghttp://commons.wikimedia.org/wiki/File:Wheelock_mt.jpghttp://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpghttp://commons.wikimedia.org/wiki/File:Estacaobras.jpghttp://commons.wikimedia.org/wiki/File:Speedo_angle.jpghttp://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPGhttp://commons.wikimedia.org/wiki/File:Dice_02138.JPGhttp://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg