Monitoring with Nagios and Ganglia
-
Upload
maciej-lasyk -
Category
Technology
-
view
14.614 -
download
3
Transcript of Monitoring with Nagios and Ganglia
Maciej Lasyk, Ganglia & Nagios
Maciej Lasyk11. Sesja LinuksowaWrocaw, 2014-04-06
1/25
Ganglia & Nagios
Ganglia.. what?
Ganglia cluster / group of neurons found outsidethe central nervous system
Maciej Lasyk, Ganglia & Nagios
2/25
Just a little about monitoring
- the need for monitoring
Maciej Lasyk, Ganglia & Nagios
3/25
Just a little about monitoring
- the need for monitoring- measuring availability
Maciej Lasyk, Ganglia & Nagios
3/25
Just a little about monitoring
- the need for monitoring- measuring availability- measuring performance
Maciej Lasyk, Ganglia & Nagios
3/25
Just a little about monitoring
- the need for monitoring- measuring availability- measuring performance- gathering additional metrics
Maciej Lasyk, Ganglia & Nagios
3/25
Monitoring is critical for HA
How to measure availability?
Maciej Lasyk, Ganglia & Nagios
4/25
Monitoring is critical for HA
How to measure availability?A = Uptime / (Uptime + Downtime)
Maciej Lasyk, Ganglia & Nagios
4/25
Monitoring is critical for HA
How to measure availability?A = Uptime / (Uptime + Downtime)MTTD (Mean Time to Diagnose)The average time it takes to diagnose the problem
Maciej Lasyk, Ganglia & Nagios
4/25
Monitoring is critical for HA
How to measure availability?A = Uptime / (Uptime + Downtime)MTTD (Mean Time to Diagnose)The average time it takes to diagnose the problemMTTR (Mean Time to Repair)The average time it takes to fix a problem
Maciej Lasyk, Ganglia & Nagios
4/25
Monitoring is critical for HA
How to measure availability?A = Uptime / (Uptime + Downtime)MTTD (Mean Time to Diagnose)The average time it takes to diagnose the problemMTTR (Mean Time to Repair)The average time it takes to fix a problemMTTF (Mean Time to Failure)The average time there is correct behavior
Maciej Lasyk, Ganglia & Nagios
4/25
Monitoring is critical for HA
How to measure availability?A = Uptime / (Uptime + Downtime)MTTD (Mean Time to Diagnose)The average time it takes to diagnose the problemMTTR (Mean Time to Repair)The average time it takes to fix a problemMTTF (Mean Time to Failure)The average time there is correct behaviorMTBF (Mean Time Between Failures)The average time between different failures of the service
Maciej Lasyk, Ganglia & Nagios
4/25
Monitoring is critical for HA
Maciej Lasyk, Ganglia & Nagios
4/25
Monitoring is critical for HA
Maciej Lasyk, Ganglia & Nagios
A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR)
4/25
What should we monitor?
Maciej Lasyk, Ganglia & Nagios
- hardware housing- devices- storage- network- hosts- software (very deep hole)
5/25
What should we monitor?
Maciej Lasyk, Ganglia & Nagios
- hardware housing- devices- storage- network- hosts- software (very deep hole)
Think dependencies!
5/25
When outage hits us don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
6/25
When outage hits us don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications- EscalationsL1 L2 L3 L4 lol ;)desktop support / devs / ops / networking / / storage / middleware / dc / security
6/25
When outage hits us don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications- EscalationsL1 L2 L3 L4 lol ;)desktop support / devs / ops / networking / / storage / middleware / dc / security- Clock is ticking it should be simple
6/25
When outage hits us don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications- EscalationsL1 L2 L3 L4 lol ;)desktop support / devs / ops / networking / / storage / middleware / dc / security- Clock is ticking it should be simple- What if cell is offline or someone is out?
6/25
Monitoring: notifications issues
Maciej Lasyk, Ganglia & Nagios
- false positives
7/25
Maciej Lasyk, Ganglia & Nagios
- false positives- major events
Monitoring: notifications issues
7/25
Maciej Lasyk, Ganglia & Nagios
- false positives- major events- failover notifications?
Monitoring: notifications issues
7/25
Maciej Lasyk, Ganglia & Nagios
- false positives- major events- failover notifications?- tolerance & critical thresholds
Monitoring: notifications issues
7/25
Monitoring: reporting
Maciej Lasyk, Ganglia & Nagios
- baseline
8/25
Maciej Lasyk, Ganglia & Nagios
- baseline- correlation between incidents and change management
Monitoring: reporting
8/25
Maciej Lasyk, Ganglia & Nagios
- baseline- correlation between incidents and change management- trending info
Monitoring: reporting
8/25
Maciej Lasyk, Ganglia & Nagios
- baseline- correlation between incidents and change management- trending info- reporting
Monitoring: reporting
8/25
Monitoring: good practices
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!- DVCS
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!- DVCS- testing envs
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!- DVCS- testing envs- think usability!
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!- DVCS- testing envs- think usability!- passive checks
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!- DVCS- testing envs- think usability!- passive checks- automate don't hardcode
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!- DVCS- testing envs- think usability!- passive checks- automate don't hardcode- security
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
Last but not least...Quis custodiet ipsos custodes?(Who will guard the guards?)
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts- hosts, hostgroups
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts- hosts, hostgroups- services, service groups
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts- hosts, hostgroups- services, service groups- templates
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts- hosts, hostgroups- services, service groups- templates- time periods
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts- hosts, hostgroups- services, service groups- templates- time periods- host and services dependencies
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts- hosts, hostgroups- services, service groups- templates- time periods- host and services dependencies- regular expressions
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Checks and states- frequencies & thresholds
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Checks and states- frequencies & thresholds- scheduling downtimes
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Checks and states- frequencies & thresholds- scheduling downtimes- outages and flapping
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications- periods
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications- periods- groups
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications- periods- groups- which states to be notified about?
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications- periods- groups- which states to be notified about?- escalations / rotations
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications- periods- groups- which states to be notified about?- escalations / rotations- custom notifications method
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Monitoring remotes- NRPE daemons- checks via SSH
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface tactical overview
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface availability reports
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface trends
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface network maps
10/25
Maciej Lasyk, Ganglia & Nagios
Networking recap
Unicast
11/25
Maciej Lasyk, Ganglia & Nagios
Networking recap
Multicast
11/25
Maciej Lasyk, Ganglia & Nagios
Networking recap
Broadcast
11/25
Maciej Lasyk, Ganglia & Nagios
Ganglia what is it?
Problems of big scale:
20k hosts with zylion metrics probed every 10 seconds
It is fully redundant (until you spoil it)
It is very scalable
Regexp searches and creating of views adhoc :)
12/25
Maciej Lasyk, Ganglia & Nagios
Ganglia architecture
13/25
Maciej Lasyk, Ganglia & Nagios
Ganglia architecture
13/25
Maciej Lasyk, Ganglia & Nagios
Ganglia topologies
Default multicast topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia topologies
Deaf / mute multicast topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia topologies
Unicast topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia topologies
Gmetad topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia topologies
Gmetad HA topology (active - active)
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia topologies
Gmetad hierarchical topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia RRDcached
15/25
Maciej Lasyk, Ganglia & Nagios
Ganglia sFlow
16/25
Maciej Lasyk, Ganglia & Nagios
Ganglia web (grid view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia web (cluster view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia web (physical view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia web (host view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia web (compare hosts)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia web (events)
Events have API json based
Think integration with whatever app :)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia web (dashboards)
- Create view -> apply as dashboard
- Create dashboard from XML
- Generate graphs and add to views
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia web (graphs)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia metrics
- base / extended metrics- own modules- c / c++- mod_python- spoofing- gmetric- gmetric4j / java- Which to choose? gmetric / python / c/c++?
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia metrics
- base / extended metrics
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia metrics
- base / extended metrics- own modules
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia metrics
- base / extended metrics- own modules- c / c++
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia metrics
- base / extended metrics- own modules- c / c++- mod_python
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia metrics
- base / extended metrics- own modules- c / c++- mod_python- spoofing
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia metrics
- base / extended metrics- own modules- c / c++- mod_python- spoofing- gmetric- gmetric4j / java
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia metrics
- base / extended metrics- own modules- c / c++- mod_python- spoofing- gmetric- gmetric4j / java- Which to choose? gmetric / python / c/c++?
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia and logfiles?
ganglia-logtailer
- https://bitbucket.org/maplebed/ganglia-logtailer
- parser logfiles (realtime)
- pushes data to ganglia (via gmetric)
- yup based on specific log formats
- yet still open source so poke around ;)
19/25
So... Nagios + Ganglia!
Maciej Lasyk, Ganglia & Nagios
3 ways of integration:
- ganglia-web/nagios (PHP & bash based)
https://github.com/ganglia/ganglia-web
- ganglia-nagios-bridge (Python & cron based)https://github.com/ganglia/ganglia-nagios-bridge
- check-ganglia-metric (Python)https://github.com/ganglia/ganglia_contrib
20/25
Nagios + Ganglia: ganglia-web/nagios
Maciej Lasyk, Ganglia & Nagios
https://github.com/ganglia/ganglia-webSending Nagios Data to Gangliaservice_perfdata_commandOr replace Nagios checks with Ganglia!- Check heartbeat.- Check a single metric on a specific host.- Check multiple metrics on a specific host.- Check multiple metrics across a regex-defined range of hosts
21/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: ganglia-web/nagios
Nagios pulls info from Ganglia via HTTP
21/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: ganglia-nagios-bridge
- https://github.com/ganglia/ganglia-nagios-bridge
- Python script run in e.g. in crontab
- pulls data from Ganglia XML via sockets
- parses XML
- send data to Nagios
- Nagios commits only passive checks
22/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: check_ganglia_metric
- https://pypi.python.org/pypi/check_ganglia_metric/
- basically Nagios plugin
- pulls data from Ganglia XML via sockets
- check_ganglia_metric.py \--gmetad_host=gmetad-server.example.com \--metric_host=host.example.com --metric_name=cpu_idle
23/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia
Which one integration should I use?
24/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia
Which one integration should I use?
Seriously try yourself and test
24/25
Maciej Lasyk, Ganglia & Nagios
Freenode #ganglia
https://lists.sourceforge.net/lists/listinfo/ganglia-general
24.5/25
sources?
Maciej Lasyk, Ganglia & Nagios
25/25
- Monitoring with Ganglia book- also nagios.org- and Web Operations book- plus some experience ;)
Maciej Lasyk
11. Sesja Linuksowa
2014-04-06, Wrocaw
http://maciek.lasyk.info/sysop
@docent-net
Ganglia & Nagios
Thank you :)
Maciej Lasyk, Ganglia & Nagios
25/25