Alertmanager crash tolerance and missing notifications
26th May, 2023
The Prometheus Alertmanager is responsible for receiving alerts, deduplicating and grouping them; and sending notifications about firing and resolved alerts to a number of configured receivers, such as email or Slack. However, despite being such a well known and popular open source project with widespread use, it seems to me that Alertmanager is not well understood. One such example of this is the promise Alertmanager makes around delivering notifications for alerts it has received, and reports from users of missing resolved notifications.
It is regarded that Alertmanager has at-least once behaviour. Alertmanager might deliver the same notification multiple times, and after some unbounded amount of time, but a notification will be delivered. However, what happens if Alertmanger crashes or restarts in between receiving an alert and sending the notification? Before we can answer at this question, let's first look at how the Alertmanager works.
How it works
Upon receiving a firing or resolved alert, Alertmanager routes the alert through the top-level route in the configuration file. You can think of it as the incoming route for all alerts. If the top-level route has child routes, the alert is passed through each child route until the first match (an alert can match multiple routes if continue is set to true). If no child routes match, the matching route is the top-level route.
For each matching route, alerts are grouped together into aggregation groups based on the labels in the group_by for the route. For example, if a route has a group by of [foo, bar]
then all alerts with labels {foo=bar, bar=baz}
are grouped together and all alerts with labels {foo=baz, bar=qux}
are grouped together. Once alerts have been put into aggregation groups, the alerts wait to be "flushed". This flushing happens at a regular interval and is configurable with group_interval. When an aggregation group is flushed, its alerts are copied and sent through the notification pipeline. Aggregation groups are not written to disk, and kept only in memory on the Alertmanager.
Once an alert has passed inhibition and silencing, it is sent to each integration in the matching route's receiver via the Fanout and Multi stages. The Fanout stage creates separate sub-pipelines for each integration, and sends the alert through each of their Wait, Dedup, Retry and Set notifies stages. The Wait stages control failover of notifications for Alertmanager HA, and is not important here. The Dedup stage checks if a notification should be sent for this flush of the aggregation group. The Retry stage is where the notification is sent and retried on failure. The final stage Set notifies updates something called the notification log.
The notification log tracks when the Alertmanager last sent a notification for each (aggregation_group, receiver, integration), the alerts that were sent in the notification and which alerts were firing and which were resolved. It is written to disk so a. Alertmanager can tolerate crashes and restarts and b. Alertmanagers partipating in HA can gossip their notification logs to each to prevent duplicate notifications when running redundant Alertmanagers.
So what happens?
Let's come back to the question of what happens if Alertmanger crashes or restarts in between receiving an alert and sending the notification. We know that Alertmanager keeps aggregation groups in memory and does not write them to disk. However, Alertmanager does write the notification log to disk. But as we will see in a minute, this is not sufficient to tolerate crashes because the notification log does not know if there are firing or resolved alerts that have been received but not notified. All it knows is the last notification sent for each (aggregation_group, receiver, integration), the alerts that were sent in the notification and which alerts were firing and which were resolved.
Normal operation
Let's see what happens under normal conditions without crashes or restarts.
1. An alert is sent to Alertmanager using cURL:
curl -H "Content-Type: application/json" http://127.0.0.1:9093/api/v2/alerts -d '[{"labels": {"foo":"bar"}}]'
2. Alertmanager receives the alert. Then 30 seconds later (group_wait) the aggregation group is flushed and a notification is sent:
ts=2023-05-26T14:19:56.446Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active] ts=2023-05-26T14:20:26.447Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}:{foo=\"bar\"}" msg=flushing alerts=[[3fff2c2][active]] ts=2023-05-26T14:20:26.450Z caller=notify.go:756 level=debug component=dispatcher receiver=test integration=webhook[0] msg="Notify success" attempts=1
3. The notification is received:
2023/05/26 14:20:26 {"receiver":"test","status":"firing","alerts":[{"status":"firing","labels":{"foo":"bar"},"annotations":{},"startsAt":"2023-05-26T14:19:56.446942+01:00","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"3fff2c2d7595e046"}],"groupLabels":{"foo":"bar"},"commonLabels":{"foo":"bar"},"commonAnnotations":{},"externalURL":"http://Georges-Air.fritz.box:9093","version":"4","groupKey":"{}:{foo=\"bar\"}","truncatedAlerts":0}
Crash before sending notification for firing alert
Let's see what happens when Alertmanager crashes or restarts between receiving a firing alert and sending the notification.
1. Like before, an alert is sent to Alertmanager using cURL:
curl -H "Content-Type: application/json" http://127.0.0.1:9093/api/v2/alerts -d '[{"labels": {"foo":"bar"}}]'
2. Alertmanager receives the alert, but just a couple seconds later the Alertmanager crashes:
ts=2023-05-26T14:25:15.985Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active] zsh: killed ./alertmanager --config.file=config.yml --log.level=debug
3. The Alertmanager is restarted:
ts=2023-05-26T14:25:29.767Z caller=main.go:241 level=info msg="Starting Alertmanager" version="(version=0.25.0, branch=main, revision=5adc7369c838c31fcbaa7d413951a2dc01ae87ae)"
We wait but no firing notification is sent. This happens because the aggregation group was only kept in memory, so following a crash or restart the aggregation group is lost, and will not be re-created until the client sends the alert to the Alertmanager a second time (for example, after the next evaluation interval). In most cases this is not an issue because clients (such as Prometheus and Grafana Managed Alerts) send all firing alerts to the Alertmanager at regular intervals to prevent them from being resolved (when the EndsAt time has elapsed).
4. The alert is re-sent to Alertmanager using cURL, is flushed 30 seconds later and a notification is sent:
ts=2023-05-26T14:26:55.377Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active] ts=2023-05-26T14:27:25.377Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}:{foo=\"bar\"}" msg=flushing alerts=[[3fff2c2][active]] ts=2023-05-26T14:27:25.382Z caller=notify.go:756 level=debug component=dispatcher receiver=test integration=webhook[0] msg="Notify success" attempts=1
Crash before sending notification for resolved alert
Let's instead see what happens when Alertmanager crashes or restarts between receiving a resolved alert and sending the notification.
1. Like before, an alert is sent to Alertmanager using cURL:
curl -H "Content-Type: application/json" http://127.0.0.1:9093/api/v2/alerts -d '[{"labels": {"foo":"bar"}}]'
2. Alertmanager receives the alert. Then 30 seconds later (group_wait) the aggregation group is flushed and a notification is sent:
ts=2023-05-26T14:32:24.921Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active] ts=2023-05-26T14:32:54.922Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}:{foo=\"bar\"}" msg=flushing alerts=[[3fff2c2][active]] ts=2023-05-26T14:32:54.923Z caller=notify.go:756 level=debug component=dispatcher receiver=test integration=webhook[0] msg="Notify success" attempts=1
3. The alert is resolved using cURL:
curl -H "Content-Type: application/json" http://127.0.0.1:9093/api/v2/alerts -d '[{"labels": {"foo":"bar"}, "endsAt": "2023-05-26T13:34:00+01:00"}]'
4. The resolved alert is received, but just a couple seconds later the Alertmanager crashes again:
ts=2023-05-26T14:34:27.798Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][resolved] zsh: killed ./alertmanager --config.file=config.yml --log.level=debug
5. The Alertmanager is restarted:
ts=2023-05-26T14:34:42.948Z caller=main.go:241 level=info msg="Starting Alertmanager" version="(version=0.25.0, branch=main, revision=5adc7369c838c31fcbaa7d413951a2dc01ae87ae)"
We wait but no notifications are sent. Both firing and resolved notifications for this alert have stopped, despite having a repeat_interval of 1m. 10 minutes later still no notifications have been sent.
6. 10 minutes after restarting Alertmanager, we choose to send the resolved alert using cURL one more time:
curl -H "Content-Type: application/json" http://127.0.0.1:9093/api/v2/alerts -d '[{"labels": {"foo":"bar"}, "endsAt": "2023-05-26T13:34:00+01:00"}]'
7. The resolved alert is received once more. However, this time the aggregation group is flushed and a notification is sent:
ts=2023-05-26T14:34:42.948Z caller=main.go:241 level=info msg="Starting Alertmanager" version="(version=0.25.0, branch=main, revision=5adc7369c838c31fcbaa7d413951a2dc01ae87ae)" ts=2023-05-26T14:34:52.952Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.0017855s ... ts=2023-05-26T14:45:10.973Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][resolved] ts=2023-05-26T14:45:10.974Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}:{foo=\"bar\"}" msg=flushing alerts=[[3fff2c2][resolved]] ts=2023-05-26T14:45:10.975Z caller=notify.go:756 level=debug component=dispatcher receiver=test integration=webhook[0] msg="Notify success" attempts=1
What happened here?
Like in the previous example where Alertmanager crashed before sending a firing notification, the resolved alert is lost as aggregation groups are only kept in memory. The notification log, which is persisted to disk, shows that the last sent notification for this alert was a firing notification. However Alertmanager does not use the notification log to recreate aggregation groups following a crash or restart.
This is an issue if clients (such as Grafana Managed Alerts) sends resolved alerts to Alertmanager once when the alert changes from firing to resolved. If the Alertmanager crashes or restarts between receiving the alert and sending the notification then a notification for this resolved alert will never be sent. To prevent this from happening the client must send Alertmanager all resolved alerts at regular intervals and for an indeterminate period of time in order to guarantee at-least once notifications. Indeterminate because the client does not know if a notification was ever sent or not.
Summary
You must send Alertmanager all firing and resolved alerts at regular intervals. It is not sufficient to send Alertmanager a firing or resolved alert once when the alert starts firing or is resolved because Alertmanager makes no guarantees that once an alert has been received a notification will be sent. In other words, an individual Alertmanager process cannot guarantee at-least once notifications in a fail-stop model.
This situation can be somewhat mitigated running Alertmanager in HA. Here the chances of all replicas crashing or restarting before a notification has been sent is low but not impossible. For example, consider a K8s rolling release when all pods are restarted within the group_wait or group_interval timers. I will leave that example as an exercise to think about.