Download - Monitoring NGINX (plus): key metrics and how-to

Monitoring nginxAlexis Lê-Quôc, Datadog

@alq

Agenda• Dramatis personae • Observations • Monitoring 1 nginx (plus) with logs • Monitoring 1 nginx (plus) with metrics • Monitoring N nginx effectively

@alq CTO at Datadog

Datadog == monitoring• Monitoring as a service • Work really will with large, dynamic environments (e.g. clouds) • Aggregate performance metrics • Correlate nginx performance with the rest of your infrastructure

ObservationsFrom the field

Some stats• Across all monitored servers • nginx ~10% • Apache ~5% • CPU and CPU/$ is the dominant resource

% of instances per core count

0%

10%

20%

30%

40%

Core count1 2 4 8 12 16 24 32

10%

1%3%

10%

30%

7%

39%

10%

% of instances per type (AWS only)

0%

7.5%

15%

22.5%

30%

EC2 typec3.l c3.2xl c1.xl c3.8xl m3.l c3.xl m3.m cc2.8xl t2.m c3.4xl rest

8.6%

3.1%4.4%4.5%4.7%5%5.3%

7.6%

13%14%

30%

Monitoring nginx1. Monitoring with logs 2. Monitoring with status 3. Monitoring with statsd

Monitoring with logs

• Canonical example of log indexers • Your choice of:

• logstash • splunk • logentries, sumologic, loggly, etc.

nginx log forwarder indexer UI

Monitoring with logs

nginx log forwarder indexer UI

Strengths Weaknesses

forensics & anomalies low signal-to-noise ratio

content-driven analysis “black box”

Monitoring with metrics

• open-source: ngx_http_stub_status_module • bare-bone metrics • human-readable text presentation

• plus: ngx_http_status_module • a lot more metrics for each function • json format

• Your choice of… • Datadog, Nagios, Zabbix, etc. for open-source • Datadog for nginx plus

nginx status collector aggregator UI/alerts

Monitoring with metrics

nginx status collector aggregator UI/alerts


lightweight & real-time no insight into content

“white box”

Simple metrics taxonomy1. What it measures

• Work or resource • Focus on work because work == value • Resource analysis useful to understand performance

• Use Brendan Gregg’s USE • Utilization (% over time) • Saturation (queue length) • Errors (count over time)

2. Type • Gauge: sample • Counter: accumulated sample, needs to be derived to be

meaningful

http://www.brendangregg.com/usemethod.html

Open-source metrics

Class Type Resource/Work Notes

Current connections Gauge Resource reading, writing,

idleAccepted

connections Counter Resource

Handled connections Counter Resource <= accepted if

resource limit

Requests Counter Work True purpose of the server

•Latency must be measured using logs or statsd.

Key “plus” metrics

Class Type Resource/Work Notes

5xx Errors Counter Work without log analysis

5xx/sum(Nxx) Gauge Work error rate %

idle/dropped connections Gauge Resource saturation

active/total connections Gauge Resource upstream

capacity

Requests Counter Work true purpose of the server

• Latency must be measured using logs or statsd.

Monitoring with statsd

nginx statsd UI/alerts


lightweight, real-time, standard not comprehensive

custom metrics, content-aware

https://github.com/zebrafishlabs/nginx-statsd

Example

Monitoring nginx1. Logs for content-analysis (forensics, anomalies, marketing) 2. Status for (white box) performance monitoring 3. statsD for custom metrics

No single method gives you everything you need.

Monitoring a lot of nginx1. Requires aggregation 2. It’s all about Metadata (“Pet-to-cattle” mindset) 3. Correlation

Aggregation• By default for log-based monitoring • Not by default for metric-based monitoring

Metadata• Analyze by properties that are not the host identity • Find anomalies that are not obvious • Pet-to-cattle evolution: hosts don’t matter, services do

Correlation• nginx is only one piece of the infrastructure

#plugwww.datadog.com

http://www.datadog.com

Thank you!Questions/Comments? @alq