The Dark of Building an Production Incident Syste

download The Dark of Building an Production Incident Syste

of 58

  • date post

    19-Jun-2015
  • Category

    Technology

  • view

    424
  • download

    0

Embed Size (px)

Transcript of The Dark of Building an Production Incident Syste

  • 1. The Dark Art of Building a Production Incident System @Alois Reitbauer Tech. Evangelist & Product Mgr., Compuware

2. No broken cables 3. No datacenter fires 4. Other things can happen as well Continuous deploymentsInfrastructure changes other everyday stuff 5. Scaling an incident system 6. How it feels to do what we do 7. Do you alert? Typical error rate of 3 percent at 10.000 transactions/min During the night we now have 5 errors in 100 requests. 8. Do you alert? Typical response time has been around 300 ms. Now we see response times up to 600 ms. 9. We a r e g o o d a t f i x i n g problems, but not really good at detecting them. 10. How can we get better? . 11. It is all about statisticsI t s a l l a b o u t s t a t i s t i c s 12. Statistics is about objectively lying to yourself in a meaningful way. 13. How to design an incident 14. It looks really simple How to calculate this value?Which metric to pick?How to get this baseline?How to define that this happened? 15. Which metrics to pick? 16. Three types of metrics Capacity Metrics Define how much of resource is used. Discrete Metrics Simple countable things, like errors or users. Continuous Metrics Metrics represented by a range of values at any given time. 17. Capacity Metrics Good for capacity planning, not so good for production alerting 18. Connection Pools 19. b e tte r u s e Connection acquisition time Tells you, whether anyone needed a connection and did not get it. 20. CPU Usage 21. b e tte r u s e Combination of Load Average and CPU usage even better correlate the with response times of applications 22. D i s c re te M e t r i c s Pretty easy to track and analyze. 23. C o nt i n u o u s M e t r i c s Require some extra work as they are not that easy to track. 24. Continuous Metrics The hope42 25. Continuous Metrics The reality 26. What the average tells us 27. What the median tells us 28. How to get a baseline? 29. A baseline is not a number Baselines define the range of a value combined with a probability 30. Normal distribution as baseline Mean: 500 ms Std. Dev.: 100 ms010020030040050060068 % 400ms 500 ms 95 % 300ms 700 ms 99 % 200ms 800 ms700800900 31. This can go really wrongWhy alerts suck and monitoring solutions need to become better 32. How this leads to false alerts 33. Many false alertsAggressive Baseline 34. No alerts at allModerate Baseline 35. Find the right distribution model However, this can be really hard to impossible 36. Your distribution might look like this 37. or like this 38. or completely different you never know 39. How can we solve this problem? 40. Normal distribution - again50 Percent slower than Median97.6 Percent slower than + 297th Percentile 41. The 50 th and 90 th percentile define normal behavior without needing to know anything about the distribution model 42. Median shows the real problem 43. How to define nonnormal behavior? 44. Fortunately this is not the problem we need to solve We are only talking about missed expectations 45. Lets look at two scenarios Errors Is a certain error rate likely to happen or not?Response Times Is a certain increase in response time significantenough to trigger an incident? 46. The error rate scenario We have a typical error rate of 3 percent at 10.000 transactions/minute During the night we now have 5 errors in 100 requests. Should we alert or not? 47. What can we learn 48. Statistics is everwhere 49. B i n o m i a l D i st r i b u t i o n Tells us how likely it is to see n successes in a certain number of trials 50. How many errors are ok? Likeliness of at least n errors 120.0%18 % probability to see 5 or more errors. Which is within 2 times Std. Deviation. We do not alert.100.0%80.0%60.0%40.0%20.0%0.0% 12345678910111213141516171819 51. Response Time Example Our median response time is 300 ms and we measure 200 ms 500 ms400 ms 150 ms350 ms 350 ms200 ms 400 ms600 ms 600 ms 52. Percentile Drift Detection 53. Did the median drift significantly? Check all values above 300 ms 200 ms 500 ms400 ms 150 ms350 ms 350 ms200 ms 400 ms600 ms 600 ms7 values are higher than the median. Is this normal?We can again use the Binomial Distribution 54. Applying the Binomial Distribution We have a 50 percent likeliness to see values above the median. How likely is is that 7 out of 10 samples are higher? The probability is 17 percent, so we should not alert. 55. and we are done! How to calculate this value?Which metric to pick?How to get this baseline?How to define that this happened? 56. This was just the beginning There are many more use things about statistics, probabilities, testing, . 57. Alois Reitbauer alois.reitbauer@compuware.com @AloisReitbauer apmblog.compuware.com 58. Image Credits http://commons.wikimedia.org/wiki/File:Network_switches.jpg http://commons.wikimedia.org/wiki/File:Wheelock_mt.jpg http://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpg http://commons.wikimedia.org/wiki/File:Estacaobras.jpg http://commons.wikimedia.org/wiki/File:Speedo_angle.jpg http://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPG http://commons.wikimedia.org/wiki/File:Dice_02138.JPG http://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg