Fact-Based Monitoring

35
Fact-based Monitoring puppetconf 2014 Alexis Lê-Quôc @alq

description

Your configuration management is fact-based. Your orchestration is fact-based. Is your monitoring fact-based? What does that even mean? Monitoring is very similar to configuration, at least in its expression. Configuration cares about files, services, and hosts being present and in a certain state (""nginx should be running with the following configuration""). Monitoring cares about services being present, running, and in a certain state. Both describe your infrastructure as it should be (""nginx should be running and respond in less than 200ms""). Fact-based monitoring is about being able to control monitoring with the same facts that Puppet uses (""monitor nginx latency wherever Puppet says it should run""). This is in contrast with imperative monitoring (""monitor nginx on host a, b and c"") that gets out of sync and leads to mailbox meltdowns from spurious alerts. Using open source and commercial examples, this talk will help you express your monitoring in a way that will feel very natural to your Puppet configuration.

Transcript of Fact-Based Monitoring

Page 1: Fact-Based Monitoring

Fact-based Monitoringpuppetconf 2014

Alexis Lê-Quôc @alq

Page 2: Fact-Based Monitoring

Alexis Lê-Quôc, @alqCTO at Datadog

Page 3: Fact-Based Monitoring

Poll: Monitoring makes me…

happy proud

cry want to hide

Page 4: Fact-Based Monitoring

Puppet brings Automation to Systems Management

Page 5: Fact-Based Monitoring

Improve Monitoring

the way Puppet has improved

Systems Management

Page 6: Fact-Based Monitoring

“The good old days”

• Your “CMDB” was Excel

• SSH in and hack away

• Little time for anything else

Page 7: Fact-Based Monitoring

Then Puppet came…

• Expressive rules that capture expected result

• Using facts and classifiers, a.k.a. metadata to figure out where to apply changes

• That freed up a lot of our time*

* on a per-machine basis

Page 8: Fact-Based Monitoring

–Me (just now)

“Puppet brings immunity of configuration to change in infrastructure”

Page 9: Fact-Based Monitoring

I have seen this before…

Page 10: Fact-Based Monitoring

–C.J. Date (1977)

“[SQL brings] immunity of application to change in storage structure and access strategy”

http://www.cs.berkeley.edu/~brewer/cs262/SystemR.pdf

Page 11: Fact-Based Monitoring

SQL

• 1974 IBM introduces System R and its Structured Query Language

• Expressive rules that capture expected result

• Using facts and predicates, a.k.a. metadata to figure out what data to get

• That freed up a lot of development time

Page 12: Fact-Based Monitoring

SQL

• From a time-consuming, imperative mess (“how”)

• … to expressive data queries (“what”)

SQL query

SELECT (desired facts) FROM (existing facts) WHERE (matching criteria)

Page 13: Fact-Based Monitoring

Puppet

• From a time-consuming, imperative mess (“how”)

• … to expressive configuration queries (“what”)

puppet apply

CHANGE (desired facts) FROM (existing puppet facts) WHERE (matching puppet classes)

Page 14: Fact-Based Monitoring

Is there a pattern?

Page 15: Fact-Based Monitoring

–MCollective overview

“Break free from ever more complex naming conventions for hostnames as a means of identity. Use a very rich set of meta

data provided by each machine to address them.”

Page 16: Fact-Based Monitoring

MCollective

• From a time-consuming, imperative mess (“how”)

• … to expressive orchestration queries (“what”)

mco rpc service restart service=nginx\ -F webpool=A

EXEC (desired actions) FROM (existing puppet facts) WHERE (matching puppet classes)

Page 17: Fact-Based Monitoring

Back to monitoring

• Monitoring is to behavior what Puppet is to configuration

• Monitoring is to behavior what MCollective is to orchestration

Page 18: Fact-Based Monitoring

Monitoring

• From a time-consuming, imperative mess (“how”)

• … to expressive monitoring queries (“what”)

Monitoring query

MONITOR (desired behavior) FROM (existing heartbeats/metrics) WHERE (matching puppet facts)

Page 19: Fact-Based Monitoring

Examples• “All provisioned web servers in the production environment,

datacenter ABC must respond to queries within 200ms”

• “All PostgreSQL servers must have a postgres: bgwriter process running”

• “At least one ActiveMQ server is up to support mcollective"

• Never mention a hostname

Page 20: Fact-Based Monitoring

Hosts are not the center of the monitoring universe.

Facts are!

Hosts are just places where facts occur.

Page 21: Fact-Based Monitoring

The proof is in the pudding…

Page 22: Fact-Based Monitoring

Hosts at the center of the universea.k.a. the Wrong Way

Page 23: Fact-Based Monitoring

–Nagios Core 4 manual on monitoring clusters

“Its fairly straightforward, so hopefully you find things easy to understand…”

Page 24: Fact-Based Monitoring

Host-centric: Monitor a DNS cluster

check_commandcheck_service_cluster!"DNS Cluster"!0!1!$SERVICESTATEID:host1:DNS Service$,$SERVICESTATEID:host2:DNS Service$,$SERVICESTATEID:host3:DNS Service$

Where do host1, host2, host3 come from?

Page 25: Fact-Based Monitoring

Host-centric: can’t use facts directly• “Host groups solve this problem”. No, they don’t.

• Combinatorial explosion, e.g. trivially

• 4 data centers (us-1, us-2, eu, apac)

• 5 classes (web, db, cache, appserver, hadoop)

• 3 environments (test, staging, prod)

• => up to 119 materialized host groups

Page 26: Fact-Based Monitoring

Nagios-bashing?

• No!

• Same fatal flaw with all host-centric monitoring tools

• Host-centric monitoring forces an extra, expensive step:

• replicate fact-based conditionals in host-centric templates

Page 27: Fact-Based Monitoring

–puppet-nagios author

“Please note that this module is not for the faint of heart. Even I (the author) have my head hurt each time I have to make

modifications to it…”

Page 28: Fact-Based Monitoring

Facts at the center of the universea.k.a. the Right Way

"De Revolutionibus manuscript p9b" by Nicolas Copernicus - www.bj.uj.edu.pl. Licensed under Public domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:De_Revolutionibus_manuscript_p9b.jpg#mediaviewer/File:De_Revolutionibus_manuscript_p9b.jpga

Page 29: Fact-Based Monitoring

Earlier Examples

• “All provisioned web servers in the production environment, datacenter ABC must respond to queries within 200ms”

• “All PostgreSQL servers must have a postgres: bgwriter process running”

• “At least one ActiveMQ server is up to support mcollective"

Page 30: Fact-Based Monitoring

In Sensu (heartbeats)• “All PostgreSQL servers must have a postgres: bgwriter process

running”

class postgres::monitoring::sensu { sensu::subscription { 'postgres': }}

• Monitoring using a fact-based query

• Is node of class “postgres” and subscribed to “postgres” or not?

• If so, it will execute the postgres check

Page 31: Fact-Based Monitoring

In Datadog (metrics)• “All provisioned web servers in the production environment,

datacenter ABC must respond to queries within 200ms”$ puppet module install datadog-datadog_agent

class { ‘datadog_agent’:

api_key => …,tags => [$environment],fact_to_tags => [“datacenter”]

}include datadog_agent::integrations::nginx

Page 32: Fact-Based Monitoring

In Datadog (metrics)• Monitoring using a fact-based query

• Puppet facts directly reused

max(nginx.request.latency{production,datacenter:ABC}) < 200

Page 33: Fact-Based Monitoring

What to take away

Page 34: Fact-Based Monitoring

Fact-based monitoring

1. Hosts are not at the center of the monitoring universe

2. Expressive monitoring uses queries

3. Monitoring queries should use Puppet facts

Page 35: Fact-Based Monitoring

Thank you!