Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure...
Transcript of Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure...
![Page 1: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/1.jpg)
Migrating from Nagios to Prometheus
NOV 07, 2019
![Page 2: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/2.jpg)
2
Linux (Ubuntu)
SDN (Cisco)
Chef
Terraform
Runtastic Infrastructure
Base
Linux KVM
OpenNebula
3600 CPU Cores
20 TB Memory
100 TB Storage
Virtualization
Physical
Hybrid
Big
Core DBs
Really a lot open source
Technologies
![Page 3: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/3.jpg)
Our Monitoring back in 2017...
3
● Nagios○ Many Checks for all Servers○ Checks for NewRelic
● Pingdom○ External HTTP Checks○ Specific Nagios Alerts○ Alerting via SMS
● NewRelic○ Error Rate○ Response Time
![Page 4: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/4.jpg)
Configuration hell….
4
![Page 5: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/5.jpg)
Alert overflow...
5
![Page 6: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/6.jpg)
Goals for our new Monitoring system
6
● Make On Call as comfortable as possible● Automate as much as possible● Make use of graphs● Rework our alerting● Make it scaleable!
![Page 7: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/7.jpg)
Starting with Prometheus...
![Page 8: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/8.jpg)
Prometheus
8
![Page 9: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/9.jpg)
Our Prometheus Setup
9
● 2x Bare Metal● 8 Core CPU● Ubuntu Linux● 7.5 TB of Storage● 7 month of Retention time● Internal TSDB
![Page 10: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/10.jpg)
Automation
![Page 11: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/11.jpg)
Our Goals for Automation
11
● Roll out Exporters on new servers automatically○ using Chef
● Use Service Discovery in Prometheus○ using Consul
● Add HTTP Healthcheck for a new Microservice○ using Terraform
● Add Silences with 30d duration○ using Terraform
![Page 12: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/12.jpg)
Consul
12
● Consul for our Terraform State● Agent Rollout via Chef● One Service definition per Exporter on each Server
![Page 13: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/13.jpg)
Consul
13
![Page 14: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/14.jpg)
What Labels do we need?
14
● What’s the Load of all workers of our Newsfeed service?○ node_load1{service=”newsfeed”, role=”workers”}
● What’s the Load of a specific Leaderboard server?○ node_load1{hostname=”prd-leaderboard-server-001”}
![Page 15: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/15.jpg)
...and how we implemented them in Consul
15
{ "service": { "name": "prd-sharing-server-001-mongodbexporter", "tags": [ "prometheus", "role:trinidad", "service:sharing", "exporter:mongodb" ], "port": 9216 }}
![Page 16: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/16.jpg)
Scrape Configuration
16
- job_name: prd consul_sd_configs: - server: 'prd-consul:8500' token: 'ourconsultoken' datacenter: 'lnz' relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,prometheus,.* action: keep - source_labels: [__meta_consul_node] target_label: hostname - source_labels: [__meta_consul_tags] regex: .*,service:([^,]+),.* replacement: '${1}' target_label: service
![Page 17: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/17.jpg)
External Health Checks
17
● 3x Blackbox Exporters● Accessing SSL Endpoints● Checks for
○ HTTP Response Code○ SSL Certificate○ Duration
![Page 18: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/18.jpg)
Add Healthcheck via Terraform
18
resource "consul_service" "health_check" { name = "${var.srv_name}-healthcheck" node = "blackbox_aws"
tags = [ "healthcheck", "url:https://status.runtastic.com/${var.srv_name}", "service:${var.srv_name}", ]}
![Page 19: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/19.jpg)
Job Config for Blackbox Exporters
19
- job_name: blackbox_aws metrics_path: /probe params: module: [http_health_monitor] consul_sd_configs: - server: 'prd-consul:8500' token: 'ourconsultoken' datacenter: 'lnz' relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,healthcheck,.* action: keep - source_labels: [__meta_consul_tags] regex: .*,url:([^,]+),.* replacement: '${1}' target_label: __param_target
![Page 20: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/20.jpg)
Add Silence via Terraform
20
resource "null_resource" "prometheus_silence" {
provisioner "local-exec" { command = <<EOF ${var.amtool_path} silence add 'service=~SERVICENAME' \ --duration='30d' \ --comment='Silence for the newly deployed service' \ --alertmanager.url='http://prd-alertmanager:9093' EOF }
![Page 21: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/21.jpg)
OpsGenie
![Page 22: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/22.jpg)
Our Initial Alerting Plan
22
● Alerts with Low Priority○ Slack Integration
● Alerts with High Priority (OnCall)○ Slack Integration○ OpsGenie
![Page 23: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/23.jpg)
...why not forward all Alerts to OpsGenie?
23
![Page 24: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/24.jpg)
Define OpsGenie Alert Routing
24
● Prometheus OnCall Integration○ High Priority Alerts (e.g. Service DOWN)○ Call the poor On Call Person○ Post Alerts to Slack #topic-alerts
● Prometheus Ops Integration○ Low Priority Alerts (e.g. Chef-Client failed runs)○ Disable Notifications○ Post Alerts to Slack #prometheus-alerts
![Page 25: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/25.jpg)
Setup Alertmanager Config
25
- receiver: 'opsgenie_oncall' group_wait: 10s group_by: ['...'] match: oncall: 'true'
- receiver: 'opsgenie' group_by: ['...'] group_wait: 10s
![Page 26: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/26.jpg)
...and its receivers
26
- name: "opsgenie_oncall" opsgenie_configs: - api_url: "https://api.eu.opsgenie.com/" api_key: "ourapitoken" priority: "{{ range .Alerts }}{{ .Labels.priority }}{{ end }}" message: "{{ range .Alerts }}{{ .Annotations.title }}{{ end }}" description: "{{ range .Alerts }}\n{{ .Annotations.summary }}\n\n{{ if ne .Annotations.dashboard \"\" -}}\nDashboard:\n{{ .Annotations.dashboard }}\n{{- end }}{{ end }}" tags: "{{ range .Alerts }}{{ .Annotations.instance }}{{ end }}"
![Page 27: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/27.jpg)
Why we use group_by[‘...’]
27
● Alert Deduplication from OpsGenie● Alerts are being grouped● Overlook Alerts
![Page 28: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/28.jpg)
Example Alerting Rule for On Call
28
- alert: HTTPProbeFailedMajor expr: max by(instance,service)(probe_success) < 1 for: 1m labels: oncall: "true" priority: "P1" annotations: title: "{{ $labels.service }} DOWN" summary: "HTTP Probe for {{ $labels.service }} FAILED.\nHealth Check URL: {{ $labels.instance }}"
![Page 29: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/29.jpg)
Example Alerting Rule with Low Priority
29
- alert: MongoDB-ScannedObjects expr: max by(hostname, service)(rate(mongodb_mongod_metrics_query_executor_total[30m])) > 500000 for: 1m labels: priority: "P3" annotations: title: "MongoDB - Scanned Objects detected on {{ $labels.service }}" summary: "High value of scanned objects on {{ $labels.hostname }} for service {{ $labels.service }}" dashboard: "https://prd-prometheus.runtastic.com/d/oCziI1Wmk/mongodb"
![Page 30: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/30.jpg)
Alert Management via Slack
30
![Page 31: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/31.jpg)
Setting up the Heartbeat
31
groups:- name: opsgenie.rules rules: - alert: OpsGenieHeartBeat expr: vector(1) for: 5m labels: heartbeat: "true" annotations: summary: "Heartbeat for OpsGenie"
![Page 32: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/32.jpg)
...and its Alertmanager Configuration
32
- receiver: 'opsgenie_heartbeat' repeat_interval: 5m group_wait: 10s match: heartbeat: 'true'
- name: "opsgenie_heartbeat" webhook_configs: - url: 'https://api.eu.opsgenie.com/v2/heartbeats/prd_prometheus/ping' send_resolved: false http_config: basic_auth: password: "opsgenieAPIkey"
![Page 33: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/33.jpg)
CI/CD Pipeline
![Page 34: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/34.jpg)
Goals for our Pipeline
34
● Put all Alerting and Recording Rules into a Git Repository● Automatically test for syntax errors● Deploy master branch on all Prometheus servers● Merge to master —> Deploy on Prometheus
![Page 35: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/35.jpg)
How it works
35
● Jenkins○ running promtool against each .yml file
● Bitbucket sending HTTP calls when master branch changes● Ruby based HTTP Handler on Prometheus Servers
○ Accepting HTTP calls from Bitbucket○ Git pull○ Prometheus reload
![Page 36: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization](https://reader033.fdocuments.net/reader033/viewer/2022053020/5f2a110b5dda0b37a45aa021/html5/thumbnails/36.jpg)
Verify Builds for each Branch
36