Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios

download Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios

If you can't read please download the document

Transcript of Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios

Distributed Monitoringwith Nagios: Past, Present, Future

Mike Guthrie

[email protected]

Distributed Monitoring Introduction

Basic Definition: Splitting up your monitoring server over multiple machines

Why use distributed monitoring?Multiple sites with firewall restrictions

Large installations that exceed the CPU and memory resources that a single machine can offer.

Understanding CPU Limitations

The primary task of the Nagios Core engine is to schedule checks

Example Monitoring Server1000 Hosts, 4 services per host, 5mn interval

Check load = ( 5000 checks / 5mn ) / 60 seconds About 16.6 checks per second

In 1 second: About 16 scripts or binary processes are being launched, with about 16 sets of results coming in and being processed by Nagios and written to disk.

When the check schedule exceeds CPU limitations, you get check latency

Picking the Right Distributed Model

Pick the right model for your environment

Think logistics: PLAN before implementationEvery hour spent in planning logistics will save tens or even hundreds of man hours later on

A 30mn task on 1 server = 5 hours on 10 servers.

Consider how to effectively view information across multiple machines

As data quantity increases, discerning useful information from it becomes more important

Viewing 10,000 hosts and 50,000 services on a page is too much raw data to be effective information

The Classic Distributed Model

CentralServer(Passive Only)

ActiveChecks

Distributed servers running active checks, forwarding results to a central server

ActiveChecks

ActiveChecks

ActiveChecks

ActiveChecks

ActiveChecks

ActiveChecks

ActiveChecks

ForwardResults After EveryCheck

The Classic Distributed Model

The Classic Distributed Model

Central Monitoring vs Central Viewing?OCSP vs Event Handlers

OSCP runs after every check

Event handlers run only on state changes

Freshness checking ensures current data

Child servers can also do local monitoring without forwarding results

Distributed servers can also receive passive checks and forward them along, creating a multi-level tree structure

The Classic Distributed Model

Strengths:Well tested, well documented, proven solution

All built into the Nagios Core package

Extremely flexible for checks, performance graphing, notifications, etc.

Can be combined with other distributed models

Challenges:Maintaining configs on multiple machines

Which server issued the check?

Where to process/view performance data?

The Classic Distributed Model

Workarounds:Use SVN, rsync, or cron to automatically maintain host and service configs on both distributed and central servers.

Use templating as much possibleRead Core Docs on Object Inheritance

Keep template definitions separate

Use naming conventions to keep configs organized

Nagios XI distributed tools:Inbound and Outbound Checks

Unconfigured Objects

The Cluster Model Nagios Load Balancing

Nagios checks are managed by a sub-process and distributed evenly across multiple servers

Works like a load balancer

Two Popular Examples:DNX: Distributed Nagios eXecutor

Mod Gearman

Check results and configs are all managed at the central server

The Cluster Model DNX

The Cluster Model DNX

DNX: How it worksWhen a check is scheduled to execute, the job is passed to a worker node

Worker node executes the check, and send results directly to results queue

Checks are not associated with any particular worker node

Bypasses the nagios.cmd pipe to eliminate a potential bottleneck

If a worker goes down, all checks continue

The Cluster Model DNX

DNX: Strengths:Central configuration management

Checks redistributed if a worker is down

Worker nodes can be added at any time

Challenges:Performance data is still handled at the central server

If the master goes down, all checks cease

The Cluster Model Mod Gearman

The Cluster Model Mod Gearman

Strengths:Central configuration management

Checks can be split by hostgroups or servicegroups, which can come in useful if groups are located in different network segments

Challenges:Performance data is still handled at the central server

If the master goes down, all checks cease

Effectively viewing more than 10k+ services on a single machine

The Central Dashboard Model

Checks are executed and managed on multiple distributed servers

Central viewer unifies all servers

Central viewer polls data from each server and displays tactical data in the UI

Examples:Nagios Fusion

MNTOS

check_MK Multisite

The Central Dashboard Model

The Central Dashboard Model: Nagios Fusion

Displays tactical overview for each server

Monitoring and object configurations compartmentalized to each server

Good for geographically distributed servers where local management is required

Unified login for all XI servers (basic auth still required for Core machines)

The Central Dashboard Model: Nagios Fusion

Strengths:Easy to add new servers

User-level control of server views

High level overview

Very little CPU usage

Commercial solution with support

Challenges:Not a monitoring solution by itself

Free 60 day trial, requires a license

The Central Dashboard Model: Nagios Fusion

The Central Dashboard Model: MNTOS

The Central Dashboard Model: Multisite

Single Server Distributed Parts

Not all environments require check distributionOffload nodutils (DB backend) to a different machine

Offload performance data processing to a different machine

Mount disk i\o intensive files to a RAM disk

A Nagios Core installs can run between 10 - 20k checks depending on what is being checked and how it is configured

Where To Go From Here?

Future of Distributed Monitoring?Improved information viewing instead of just raw data

Aggregated reporting and statistics

Business process views and monitoring

What do you, as admins, need to see in this area of software development?

Conclusion

Pick the right setup for your environment

Any of these models can be mixed and combined

PLAN before implementation:Plan for efficient maintenance

An environment that implemented 250k services being overseen by a single server took almost an entire year of planning and implementation to do it right

Environments can scale even larger with the right logistics planning in place

Conference Resources

Daniel Wittenberg: Scaling Nagios At A Giant Insurance Company @2pm Thursday35,000 hosts and 1.4 million services

Mike Weber: Reducing Server Load with Mod Gearman @10:30am Friday

Dave Williams: Author of DNX

Click to edit the outline text format

Second Outline Level

Third Outline Level

Fourth Outline Level

Fifth Outline Level

Sixth Outline Level

Seventh Outline Level

Eighth Outline Level

Ninth Outline Level

Click to edit the title text format

2011

Click to edit the title text format

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level