Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

28
1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

description

Dave Williams presentation on Multi-Tenant Nagios Monitoring. The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference

Transcript of Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

Page 1: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

1© Bull, 2014

October 14th 2014 Dave Williams

Technical Architect

Multi-Tenant Nagios Monitoring

Page 2: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

2© Bull, 2014

Agenda

Background

Multi-Tenant Monitoring

Why Multi-Tenant

Multi-Tenant Design

Service Catalogue

Futures & ‘Blue Sky thinking’

Questions

Page 3: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

3© Bull, 2014

Background

UK basedMainframe (IBM & Honeywell)

Unix (HP-UX, AIX, Solaris)

Linux (RedHat, SLES, Debian)

Network (CASE, 3COM, CISCO)

Working for BullFrench Computer Manufacturer

Mainframes, Unix, HPC, Security, Managed Services, Advisory Services

Page 4: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

4© Bull, 2014

Background

System MonitoringOpenView

Netview

Open Master

Open Source MonitoringNetSaint on AIX

Nagios

Page 5: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

5© Bull, 2014

Why Multi-Tenant ?

Outsourcing Support & MonitoringMultiple Customers

–Different Levels of security–Different Hardware / Software Platforms

One Support Team–Only need to know about real problems–Can be driven by support ticket not Nagios

Required 365 x 24–Infrastructure must survive all outages without loss of service

Page 6: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

6© Bull, 2014

Multi-Tenant Design

Each customer may have 2-3000 hosts10-100 services per host

Real time monitoring

Customer profileSLA Reporting

Batch Event completion

Different SLA’s for each Business Process per customer

Different alerting & escalation methods per customer

Page 7: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

7© Bull, 2014

Multi-Tenant Design

Hardware Platform – Central SupportVirtualised Platform (Intel based)

–XenServer Hypervisor Allows clustering with shared storage Inexpensive Licensing

Shared Storage–NAS

Using QNAP Appliances with underlying RAID-5 & Hot Spare protection Network connection using dual interfaces bound across multiple switches Could have used FreeNas

LAN Infrastructure–Dual connections to all hardware–SNMP managed switches

Page 8: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

8© Bull, 2014

Hardware Platform – Basic Schematic

Page 9: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

9© Bull, 2014

Multi-Tenant Design

Hardware Platform – ResilienceVirtualised Platform (Intel based)

–XenServer Hypervisor Allows clustering with shared storage If Primary node fails cluster will ‘spin up’ image on 2nd node

Same data / logs (Shared storage)

LAN Infrastructure–Dual connections to all hardware

Bonded interfaces for NAS access – no data loss / access loss with failure SNMP managed switches

Page 10: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

10© Bull, 2014

Hardware Setup

Page 11: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

11© Bull, 2014

Multi-Tenant Design

Hardware Platform – RecoveryVirtualised Platform (Intel based)

–XenServer Hypervisor Allows clustering with shared storage If Primary Site fails will spin up image Internet Access fails over – using BGP

Shared Storage – replicated from Prime Site–NAS

Using QNAP Appliances with underlying RAID-5 & Hot Spare protection Using RTRR (Real Time Remote Replication) between sites Network connection using dual interfaces bound across multiple switches

LAN Infrastructure–Dual connections to all hardware

Bonded interfaces for NAS access – no data loss / access loss with failure SNMP managed switches

Page 12: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

12© Bull, 2014

Hardware Platform - Resilience

Page 13: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

13© Bull, 2014

Hardware Platform – Customer Site

Using generic netbooks Minimum requirement

–1Gb Memory , Atom processor, Ethernet Port–Running Centos 6.4 64 bit Operating System

Can use Raspberry Pi for small customers–512K Memory , Arm processor , Ethernet Port –Running Raspbian Operating System

Page 14: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

14© Bull, 2014

Software Platform – Central Site

Nagios – CoreRunning latest 4.0.8

Using MK Livestatus for interfacing

Using Thruk for Visualisation

Graylog2 / Elastic SearchStore all logs & Syslog in ‘Big Data’ repository using MongoDB

Asterisk PBXAllow all alerting to use standard dial-up with speech synthesis + IVR

SMS-ClientStill using TAPI to SMS Text contacts

Page 15: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

15© Bull, 2014

Software Platform – Central Site (contd)

NRPERunning 2.1.5

NSCA &NSCA-ngUsing NSCA for external communication

Using NSCA-ng for issuing remote commands

Postfix / ProcmailUsed to generate emails but also handle responses.

Routes unsolicited alerting emails (HP Insight, Pingdom)

OTRSRecord alerts, track issues

Page 16: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

16© Bull, 2014

Software Platform – Remote Site

Nagios – CoreRunning latest 4.0.8

NRPERunning 2.14

NSCA Using NSCA for external communication

OpenVPNCommunication via IPSec VPN

Page 17: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

17© Bull, 2014

Customer Multi-Tenant

Page 18: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

18© Bull, 2014

Multi Tenant Schematic

Page 19: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

19© Bull, 2014

Service Catalogue

ITIL FlavourReally just services & their characteristics

Page 20: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

20© Bull, 2014

Service Catalogue

Agreed list of servers / servicesWith importance levels

With alerting paths

With escalation paths

Recovery options

Feeds into Service Level Agreements and Operational Level Agreements

Basis of agreed reporting structures

Page 21: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

21© Bull, 2014

Examples

Basic Spreadsheet plus Shell scriptUsually easy to create, Shell script is different for each customer based on a initial standard script

Chef or PuppetUse Exported Resources

Nagios Cookbook – Nagios Conference 2012 Presentation

Page 22: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

22© Bull, 2014

Multi Tenant Issues

Naming conventionsEvery customer has a server01

Customers naming conventions are obscure

Customers have multiple physical locations or levels of security

–This gives rise to different nagios names to actual names:–Custloc1-swfeltsw01–Custloc2-nwfeltsw01

Not so smart when a non-Nagios originated alert is received,–‘swfeltsw01 – RAID battery backup failure’ from HP Insight for example–The external alert processor has to perform table lookups before building the

appropriate NSCA command for example

Page 23: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

23© Bull, 2014

Futures & Blue Sky thinking

The Nagios Visualisation is resource heavyAll Customers want their own Dashboard

All Customers want a different screen layout

Why not move the visualisation into the cloud ?Use a Amazon EC2 image to access central Livestatus via https

Allow end user to authenticate

Customer portal allows ‘spin up’ & ‘spin down’ of images–Move billing to the customer–Scale horizontally for Visualisation

Page 24: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

24© Bull, 2014

Load Sharing

Using plugins like check_wmi_plus put a strain on the monitoring system, large number of queries that take wall clock time to complete and parse.

Better to have ‘worker nodes’ via Merlin or Mod Gearman similar to perform these functions – Raspberry Pi for example.

No great expense to add 2/3 Pi’s to customer site configurations, easy fall back if they fail – no unique locally stored data

Page 25: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

25© Bull, 2014

BPI Example

Page 26: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

26© Bull, 2014

Dashboard Example

Page 27: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

27© Bull, 2014

Questions ?

Page 28: Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

28© Bull, 2014