Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

Post on 01-Jul-2015

272 views 1 download

description

Dave Williams presentation on Multi-Tenant Nagios Monitoring. The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference

Transcript of Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

1© Bull, 2014

October 14th 2014 Dave Williams

Technical Architect

Multi-Tenant Nagios Monitoring

2© Bull, 2014

Agenda

Background

Multi-Tenant Monitoring

Why Multi-Tenant

Multi-Tenant Design

Service Catalogue

Futures & ‘Blue Sky thinking’

Questions

3© Bull, 2014

Background

UK basedMainframe (IBM & Honeywell)

Unix (HP-UX, AIX, Solaris)

Linux (RedHat, SLES, Debian)

Network (CASE, 3COM, CISCO)

Working for BullFrench Computer Manufacturer

Mainframes, Unix, HPC, Security, Managed Services, Advisory Services

4© Bull, 2014

Background

System MonitoringOpenView

Netview

Open Master

Open Source MonitoringNetSaint on AIX

Nagios

5© Bull, 2014

Why Multi-Tenant ?

Outsourcing Support & MonitoringMultiple Customers

–Different Levels of security–Different Hardware / Software Platforms

One Support Team–Only need to know about real problems–Can be driven by support ticket not Nagios

Required 365 x 24–Infrastructure must survive all outages without loss of service

6© Bull, 2014

Multi-Tenant Design

Each customer may have 2-3000 hosts10-100 services per host

Real time monitoring

Customer profileSLA Reporting

Batch Event completion

Different SLA’s for each Business Process per customer

Different alerting & escalation methods per customer

7© Bull, 2014

Multi-Tenant Design

Hardware Platform – Central SupportVirtualised Platform (Intel based)

–XenServer Hypervisor Allows clustering with shared storage Inexpensive Licensing

Shared Storage–NAS

Using QNAP Appliances with underlying RAID-5 & Hot Spare protection Network connection using dual interfaces bound across multiple switches Could have used FreeNas

LAN Infrastructure–Dual connections to all hardware–SNMP managed switches

8© Bull, 2014

Hardware Platform – Basic Schematic

9© Bull, 2014

Multi-Tenant Design

Hardware Platform – ResilienceVirtualised Platform (Intel based)

–XenServer Hypervisor Allows clustering with shared storage If Primary node fails cluster will ‘spin up’ image on 2nd node

Same data / logs (Shared storage)

LAN Infrastructure–Dual connections to all hardware

Bonded interfaces for NAS access – no data loss / access loss with failure SNMP managed switches

10© Bull, 2014

Hardware Setup

11© Bull, 2014

Multi-Tenant Design

Hardware Platform – RecoveryVirtualised Platform (Intel based)

–XenServer Hypervisor Allows clustering with shared storage If Primary Site fails will spin up image Internet Access fails over – using BGP

Shared Storage – replicated from Prime Site–NAS

Using QNAP Appliances with underlying RAID-5 & Hot Spare protection Using RTRR (Real Time Remote Replication) between sites Network connection using dual interfaces bound across multiple switches

LAN Infrastructure–Dual connections to all hardware

Bonded interfaces for NAS access – no data loss / access loss with failure SNMP managed switches

12© Bull, 2014

Hardware Platform - Resilience

13© Bull, 2014

Hardware Platform – Customer Site

Using generic netbooks Minimum requirement

–1Gb Memory , Atom processor, Ethernet Port–Running Centos 6.4 64 bit Operating System

Can use Raspberry Pi for small customers–512K Memory , Arm processor , Ethernet Port –Running Raspbian Operating System

14© Bull, 2014

Software Platform – Central Site

Nagios – CoreRunning latest 4.0.8

Using MK Livestatus for interfacing

Using Thruk for Visualisation

Graylog2 / Elastic SearchStore all logs & Syslog in ‘Big Data’ repository using MongoDB

Asterisk PBXAllow all alerting to use standard dial-up with speech synthesis + IVR

SMS-ClientStill using TAPI to SMS Text contacts

15© Bull, 2014

Software Platform – Central Site (contd)

NRPERunning 2.1.5

NSCA &NSCA-ngUsing NSCA for external communication

Using NSCA-ng for issuing remote commands

Postfix / ProcmailUsed to generate emails but also handle responses.

Routes unsolicited alerting emails (HP Insight, Pingdom)

OTRSRecord alerts, track issues

16© Bull, 2014

Software Platform – Remote Site

Nagios – CoreRunning latest 4.0.8

NRPERunning 2.14

NSCA Using NSCA for external communication

OpenVPNCommunication via IPSec VPN

17© Bull, 2014

Customer Multi-Tenant

18© Bull, 2014

Multi Tenant Schematic

19© Bull, 2014

Service Catalogue

ITIL FlavourReally just services & their characteristics

20© Bull, 2014

Service Catalogue

Agreed list of servers / servicesWith importance levels

With alerting paths

With escalation paths

Recovery options

Feeds into Service Level Agreements and Operational Level Agreements

Basis of agreed reporting structures

21© Bull, 2014

Examples

Basic Spreadsheet plus Shell scriptUsually easy to create, Shell script is different for each customer based on a initial standard script

Chef or PuppetUse Exported Resources

Nagios Cookbook – Nagios Conference 2012 Presentation

22© Bull, 2014

Multi Tenant Issues

Naming conventionsEvery customer has a server01

Customers naming conventions are obscure

Customers have multiple physical locations or levels of security

–This gives rise to different nagios names to actual names:–Custloc1-swfeltsw01–Custloc2-nwfeltsw01

Not so smart when a non-Nagios originated alert is received,–‘swfeltsw01 – RAID battery backup failure’ from HP Insight for example–The external alert processor has to perform table lookups before building the

appropriate NSCA command for example

23© Bull, 2014

Futures & Blue Sky thinking

The Nagios Visualisation is resource heavyAll Customers want their own Dashboard

All Customers want a different screen layout

Why not move the visualisation into the cloud ?Use a Amazon EC2 image to access central Livestatus via https

Allow end user to authenticate

Customer portal allows ‘spin up’ & ‘spin down’ of images–Move billing to the customer–Scale horizontally for Visualisation

24© Bull, 2014

Load Sharing

Using plugins like check_wmi_plus put a strain on the monitoring system, large number of queries that take wall clock time to complete and parse.

Better to have ‘worker nodes’ via Merlin or Mod Gearman similar to perform these functions – Raspberry Pi for example.

No great expense to add 2/3 Pi’s to customer site configurations, easy fall back if they fail – no unique locally stored data

25© Bull, 2014

BPI Example

26© Bull, 2014

Dashboard Example

27© Bull, 2014

Questions ?

28© Bull, 2014