Alertmanager - PromCon
Transcript of Alertmanager - PromCon
![Page 1: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/1.jpg)
Alertmanager and high availability
Frederic BranczykSoftware Engineer at CoreOS
Prometheus/Alertmanager/Kubernetes@brancz
![Page 2: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/2.jpg)
Where does CoreOS fit in?
● Automating Monitoring infrastructure
● Prometheus + Kubernetes
![Page 3: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/3.jpg)
What will I be talking about?
● From alert to notification
● High availability contract
● High availability implementation
● Implications on operating HA Alertmanager
![Page 4: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/4.jpg)
Alertmanager Features
● Receives and groups alerts
● Deduplicates alerts
● Sends notifications to providers
○ Pagerduty, email, Slack, etc.
● Silencing
![Page 5: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/5.jpg)
Prometheus & Alertmanager
![Page 6: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/6.jpg)
Alerting Rule Alerting Rule Alerting Rule Alerting Rule...
04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/profile, method=GET04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET04:11 hey, HighLatency, service=”X”, zone=”eu-west”, path=/user/settings, method=GET04:11 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/settings, method=POST04:12 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=GET04:13 hey, HighLatency, service=”X”, zone=”eu-west”, path=/index, method=POST04:13 hey, CacheServerSlow, service=”X”, zone=”eu-west”, path=/user/profile, method=POST . . .04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/comments, method=GET04:15 hey, HighErrorRate, service=”X”, zone=”eu-west”, path=/user/profile, method=POST
![Page 7: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/7.jpg)
Grouped in one notification
● 3 x HighLatency
● 10 x HighErrorRate
● 2 x CacheServerSlow
● (+individual Alerts)
![Page 8: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/8.jpg)
Boiled down:
Alertmanager reliably
sends notifications
![Page 9: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/9.jpg)
High Availability
![Page 10: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/10.jpg)
Infrastructure Scaling Story
Prometheus
Prometheus
Alertmanager
Alertmanager
Gossip
Microservice 1
Microservice 2
Microservice 3
Microservice 1
Microservice 2
Microservice 3
...
![Page 11: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/11.jpg)
Why decoupled?
● Keep Prometheus alerting simple
● High availability of Prometheus
● No state sharing between Prometheus
![Page 12: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/12.jpg)
Example Alerting Rule
ALERT NoLeaderIF etcd_has_leader == 0FOR 10mLABELS { severity = "warning"}ANNOTATIONS { summary = "etcd no leader", description = "etcd instance has no leader",}
![Page 13: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/13.jpg)
Alert Evaluation in Prometheus
Rule 1
Rule 2
Rule 3
...
● Evaluate Rule/Alert
● Fire alert against Alertmanager
Repeat in *rule evaluation interval*
![Page 14: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/14.jpg)
Simple configuration
● Resolve alerts in 5m
● Group by job label
● Group for 10 seconds
● Send via webhook
receiver
global: resolve_timeout: 5m route: group_by: ['job'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'webhook'receivers:- name: 'webhook' webhook_configs: - url: 'http://127.0.0.1:5001/'
![Page 15: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/15.jpg)
Notification Pipeline
Silence
Do not continue
Wait
Position in cluster
multiplied by 5
seconds
Dedup
Has notification
already been sent?
Send
Send notification via favorite
provider
Gossip
Tell other peers
notification has been
sent
![Page 16: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/16.jpg)
What is gossiped?
● Yes
○ Sent notifications
○ Silences
● No
○ Received alerts
![Page 17: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/17.jpg)
How? CRDTs!
● Conflict-free replicated data type
● Associativity (a+(b+c)=(a+b)+c)
● Commutativity (a+b=b+a)
● Idempotence (a+a=a)
● Well suited for AP systems
![Page 18: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/18.jpg)
Yes, but how? mesh by Weaveworks!
● Eventually consistent
● LWW-element-set
● Mergeable log of records
● Merges based on UID
○ On conflict latest timestamp wins
![Page 19: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/19.jpg)
Why not etcd?
● Simple operation
○ Less moving pieces
○ Single binary
● Want: AP not CP
![Page 20: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/20.jpg)
Silences
![Page 21: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/21.jpg)
Create Silences
Create Silence Alertmanager 0
SilencesDatabase
ID Values
1 Query, Start, End
2 Query, Start, End
Alertmanager 1
SilencesDatabase
ID Values
1 Query, Start, End
2 Query, Start, End
Gossip DeltaID: 2 ...
Merge Gossip Data
![Page 22: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/22.jpg)
Update Silences
Update SilenceUID: 1Start: Start1
Alertmanager 0
SilencesDatabase
ID Values
1 Query, Start, End
2 Query, Start, End
Alertmanager 1
SilencesDatabase
ID Values
1 Query, Start, End
2 Query, Start, End
Gossip DeltaID: 1Start: Start1
Merge Gossip Data
1 Query, Start1, End 1 Query, Start1, End
![Page 23: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/23.jpg)
Notification Log
![Page 24: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/24.jpg)
Non silenced alert example
Alertmanager 1
Alertmanager 0
Prometheus
● Wait 0s
● Wait 5s
● Dedup: Not sent→ Send
● Gossip
● Receive Gossip Data
● Deduplicate → Do not send
![Page 25: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/25.jpg)
Gossip Partition
Alertmanager 1
Alertmanager 0
Prometheus
● Wait 0s
● Wait 5s
● Dedup: Not sent→ Send
● Gossip
● Dedup: Not sent→ Send
NetworkPartition
![Page 26: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/26.jpg)
Notification Log
Alert Firing Alertmanager 0
NotificationLog
UID Values
1 Resolve,Notify,TS,...
2 Resolve,Notify,TS,...
Alertmanager 1
NotificationLog
UID Values
1 Resolve,Notify,TS,...
2 Resolve,Notify,TS,...
Gossip DeltaUID: 2 ...
Merge Gossip Data
![Page 27: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/27.jpg)
Group Key● Group at runtime
○ By Group By labels
● XOR with Route
● Concat with Receiver
global: resolve_timeout: 5m route: group_by: ['job'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'webhook'receivers:- name: 'webhook' webhook_configs: - url: 'http://127.0.0.1:5001/'
![Page 28: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/28.jpg)
DEMO!
![Page 29: Alertmanager - PromCon](https://reader031.fdocuments.net/reader031/viewer/2022020702/61faa6f01966d452a21e62f0/html5/thumbnails/29.jpg)
GitHub: @brancz
Twitter: @fredbrancz
QUESTIONS?
Thanks!
We’re hiring: coreos.com/careers
Let’s talk!
#prometheus on Freenode
More events: coreos.com/community
LONGER CHAT?
also in Berlin!