Monitoring NGINX (plus): key metrics and how-to

27
Monitoring nginx Alexis Lê-Quôc, Datadog @alq

description

NGINX just works and that's why we use it. That does not mean that it should be left unmonitored. As a web server, it plays a central role in a modern infrastructure. As a gatekeeper, it sees every interaction with the application. If you monitor it properly it can explain a lot about what is happening in the rest of your infrastructure. In this talk you will learn more about NGINX (plus) metrics, what they mean and how to use them. You will also learn different methods (status, statsd, logs) to monitor NGINX with their pros and cons, illustrated with real data coming from real servers.

Transcript of Monitoring NGINX (plus): key metrics and how-to

Page 1: Monitoring NGINX (plus): key metrics and how-to

Monitoring nginxAlexis Lê-Quôc, Datadog

@alq

Page 2: Monitoring NGINX (plus): key metrics and how-to

Agenda• Dramatis personae • Observations • Monitoring 1 nginx (plus) with logs • Monitoring 1 nginx (plus) with metrics • Monitoring N nginx effectively

Page 3: Monitoring NGINX (plus): key metrics and how-to

@alq CTO at Datadog

Page 4: Monitoring NGINX (plus): key metrics and how-to

Datadog == monitoring• Monitoring as a service • Work really will with large, dynamic environments (e.g. clouds) • Aggregate performance metrics • Correlate nginx performance with the rest of your infrastructure

Page 5: Monitoring NGINX (plus): key metrics and how-to
Page 6: Monitoring NGINX (plus): key metrics and how-to
Page 7: Monitoring NGINX (plus): key metrics and how-to

ObservationsFrom the field

Page 8: Monitoring NGINX (plus): key metrics and how-to

Some stats• Across all monitored servers • nginx ~10% • Apache ~5% • CPU and CPU/$ is the dominant resource

Page 9: Monitoring NGINX (plus): key metrics and how-to

% of instances per core count

0%

10%

20%

30%

40%

Core count1 2 4 8 12 16 24 32

10%

1%3%

10%

30%

7%

39%

10%

Page 10: Monitoring NGINX (plus): key metrics and how-to

% of instances per type (AWS only)

0%

7.5%

15%

22.5%

30%

EC2 typec3.l c3.2xl c1.xl c3.8xl m3.l c3.xl m3.m cc2.8xl t2.m c3.4xl rest

8.6%

3.1%4.4%4.5%4.7%5%5.3%

7.6%

13%14%

30%

Page 11: Monitoring NGINX (plus): key metrics and how-to

Monitoring nginx1. Monitoring with logs 2. Monitoring with status 3. Monitoring with statsd

Page 12: Monitoring NGINX (plus): key metrics and how-to

Monitoring with logs

• Canonical example of log indexers • Your choice of:

• logstash • splunk • logentries, sumologic, loggly, etc.

nginx log forwarder indexer UI

Page 13: Monitoring NGINX (plus): key metrics and how-to

Monitoring with logs

nginx log forwarder indexer UI

Strengths Weaknesses

forensics & anomalies low signal-to-noise ratio

content-driven analysis “black box”

Page 14: Monitoring NGINX (plus): key metrics and how-to

Monitoring with metrics

• open-source: ngx_http_stub_status_module • bare-bone metrics • human-readable text presentation

• plus: ngx_http_status_module • a lot more metrics for each function • json format

• Your choice of… • Datadog, Nagios, Zabbix, etc. for open-source • Datadog for nginx plus

nginx status collector aggregator UI/alerts

Page 15: Monitoring NGINX (plus): key metrics and how-to

Monitoring with metrics

nginx status collector aggregator UI/alerts

Strengths Weaknesses

lightweight & real-time no insight into content

“white box”

Page 16: Monitoring NGINX (plus): key metrics and how-to

Simple metrics taxonomy1. What it measures

• Work or resource • Focus on work because work == value • Resource analysis useful to understand performance

• Use Brendan Gregg’s USE • Utilization (% over time) • Saturation (queue length) • Errors (count over time)

2. Type • Gauge: sample • Counter: accumulated sample, needs to be derived to be

meaningful

http://www.brendangregg.com/usemethod.html

Page 17: Monitoring NGINX (plus): key metrics and how-to

Open-source metrics

Class Type Resource/Work Notes

Current connections Gauge Resource reading, writing,

idleAccepted

connections Counter Resource

Handled connections Counter Resource <= accepted if

resource limit

Requests Counter Work True purpose of the server

•Latency must be measured using logs or statsd.

Page 18: Monitoring NGINX (plus): key metrics and how-to

Key “plus” metrics

Class Type Resource/Work Notes

5xx Errors Counter Work without log analysis

5xx/sum(Nxx) Gauge Work error rate %

idle/dropped connections Gauge Resource saturation

active/total connections Gauge Resource upstream

capacity

Requests Counter Work true purpose of the server

• Latency must be measured using logs or statsd.

Page 19: Monitoring NGINX (plus): key metrics and how-to

Monitoring with statsd

nginx statsd UI/alerts

Strengths Weaknesses

lightweight, real-time, standard not comprehensive

custom metrics, content-aware

https://github.com/zebrafishlabs/nginx-statsd

Page 20: Monitoring NGINX (plus): key metrics and how-to

Example

Page 21: Monitoring NGINX (plus): key metrics and how-to

Monitoring nginx1. Logs for content-analysis (forensics, anomalies, marketing) 2. Status for (white box) performance monitoring 3. statsD for custom metrics

No single method gives you everything you need.

Page 22: Monitoring NGINX (plus): key metrics and how-to

Monitoring a lot of nginx1. Requires aggregation 2. It’s all about Metadata (“Pet-to-cattle” mindset) 3. Correlation

Page 23: Monitoring NGINX (plus): key metrics and how-to

Aggregation• By default for log-based monitoring • Not by default for metric-based monitoring

Page 24: Monitoring NGINX (plus): key metrics and how-to

Metadata• Analyze by properties that are not the host identity • Find anomalies that are not obvious • Pet-to-cattle evolution: hosts don’t matter, services do

Page 25: Monitoring NGINX (plus): key metrics and how-to

Correlation• nginx is only one piece of the infrastructure

Page 26: Monitoring NGINX (plus): key metrics and how-to

#plugwww.datadog.com

Page 27: Monitoring NGINX (plus): key metrics and how-to

Thank you!Questions/Comments? @alq