Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient...
Transcript of Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient...
![Page 1: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/1.jpg)
Efficient Monitoring and Root Cause Analysis in
Complex Systems
Witek Bedyk
![Page 2: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/2.jpg)
Agenda
● Benefits of robust monitoring
● Measurements vs. Alarms
● Importance of Alarms Correlation
● Effective Alerting
● Self-healing
![Page 3: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/3.jpg)
Why is Monitoring useful?
● Improve system / application uptime
● Reduce administration burden
● Resource optimization
● Prevent bottlenecks
● Make use of collected data (e.g. billing)
![Page 4: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/4.jpg)
Why is Monitoring useful?
● Improve system / application uptime
● Reduce administration burden
● Resource optimization
● Prevent bottlenecks
● Make use of collected data (e.g. billing)
![Page 5: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/5.jpg)
Use Case
Customer escalation:
“We have cloud outage! Keystone is flapping up and down continuously and many requests get 503 service unavailable error.”
![Page 6: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/6.jpg)
Healthcheck
Simple HTTP endpoint up or down checks on services.
http_status [0, 1]http_response_time
![Page 7: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/7.jpg)
Metrics
● Metrics measure and report on quantifiable data from your system
● cpu, memory, network, filesystem, disk IO● Services
○ MySQL, RabbitMQ, Apache, MemcacheD, etc.
● LibVirt, Open vSwitch● Applications:
○ StatsD, Prometheus
● Custom checks
![Page 8: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/8.jpg)
Dimensions
● Dimensions are a dictionary of key, value pairs used to describe metrics.
● hostname● service● component● url● device
![Page 9: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/9.jpg)
Transaction-level vs. System-level metrics
● Transaction-level: end user perspective○ Is Horizon working correctly?
● System-level: administrator perspective○ Reveals failures of service components
![Page 10: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/10.jpg)
Dependencies
MySQL
MemcacheDKeystoneApache
![Page 11: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/11.jpg)
Gathered metrics
http_statushttp_response_timeapache.net.hitsapache.performance.idle_worker_countmysql.performance.open_filesmysql.net.connectionsmemcache.curr_connectionsmemcache.get_misses_rateprocess.cpu_percprocess.open_file_descriptors
![Page 12: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/12.jpg)
Dashboards
![Page 13: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/13.jpg)
Alarms
Status of the system or resource meets criteriaindicating an action is required.
![Page 14: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/14.jpg)
Alarm definitions
● Alarm definitions are templates specifying how alarms should be created.
● grouping
● http_status > 0, match_by: ["service", "component", "hostname", "url"]
● filtering
● avg(cpu.idle_perc{service=monitoring}) < 20
![Page 15: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/15.jpg)
Use case (alarms)
Keystone API is down on node A.Keystone API is down on node A.Keystone API is down on node A.Keystone API is down on node A.
Keystone API is up on node A.Keystone API is up on node A.
MemcacheD number of connections is high on node A.
Keystone API is up on node A.Keystone API is up on node A.
MemcacheD hit rate is low on node A.
![Page 16: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/16.jpg)
Alarms correlation
● “80% of the mean time to repair is wasted on trying to locate the issue” Gartner
● Remove noise from the environment● Alerts should be:
○ meaningful○ actionable○ indicate the point of failure
![Page 17: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/17.jpg)
Vitrage
● OpenStack Root Cause Analysis service
● organize alarms○ define relationships between alarms○ represent as an entity graph
● analyze○ represent system health
● find root cause○ graphical visualization
![Page 18: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/18.jpg)
Dependencies
MySQL
MemcacheDKeystoneApache
![Page 19: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/19.jpg)
Dependencies
Keystone cluster
Keystone instances
MemcacheD
![Page 20: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/20.jpg)
Dependencies
Keystone cluster
Keystone instances
MemcacheD
![Page 21: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/21.jpg)
Dependencies
Keystone cluster
Keystone instances
MemcacheD
![Page 22: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/22.jpg)
Dependencies
Keystone cluster
Keystone instances
MemcacheD
![Page 23: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/23.jpg)
Dependencies
Keystone cluster
Keystone instances
MemcacheD
![Page 24: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/24.jpg)
Dependencies
Keystone cluster
Keystone instances
MemcacheD
![Page 25: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/25.jpg)
Monitor Analyze Plan Execute (MAPE)
Monitor Execute
Sensors Effectors
Analyze
ManagedResource
Plan
![Page 26: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/26.jpg)
Monitor Analyze Plan Execute (MAPE)
Monitor Execute
Sensors Effectors
Analyze
ManagedResource
Plan
![Page 27: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/27.jpg)
Vitrage Templates
● Vitrage Templates are used to express Condition Action scenarios.→
● if <condition> then raise deduced alarm● if <condition> then set deduced state● if <condition> then add causal relationship (used for RCA capability)● if <condition> then execute Mistral workflow
![Page 28: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/28.jpg)
Self-healing
Keystone cluster
Keystone instances
MemcacheD
![Page 29: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/29.jpg)
Self-healing
Keystone cluster
Keystone instances
MemcacheD
![Page 30: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/30.jpg)
Self-healing
Keystone cluster
Keystone instances
MemcacheD
![Page 31: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/31.jpg)
Self-healing
Keystone cluster
Keystone instances
MemcacheD
![Page 32: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/32.jpg)
OpenStack Healthcheck APIs
● more detailed checks would be useful for most OpenStack services● common middleware should get implemented in Oslo● existing old effort:
○ https://storyboard.openstack.org/#!/story/2001439○ https://review.opendev.org/617924
![Page 33: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/33.jpg)
Summary
● Robust monitoring is essential
● Measurements vs. Alarms
● Importance of Alarms Correlation
● Self-healing
![Page 34: Efficient Monitoring and Root Cause Analysis in Complex Systems · 2019-11-14 · Efficient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk. Agenda Benefits of robust](https://reader034.fdocuments.net/reader034/viewer/2022050507/5f98db7c3841cb568456fd18/html5/thumbnails/34.jpg)
Thank You谢谢
Questions and Answers