Group wait, Group interval and Repeat interval explained
1st June, 2023
There seems to be a lot of confusion around the Group wait, Group interval and Repeat interval timers in Prometheus Alertmanager. The Prometheus documentation explains them as follows:
# How long to initially wait to send a notification for a group # of alerts. Allows to wait for an inhibiting alert to arrive or collect # more initial alerts for the same group. (Usually ~0s to few minutes.) [ group_wait:| default = 30s ] # How long to wait before sending a notification about new alerts that # are added to a group of alerts for which an initial notification has # already been sent. (Usually ~5m or more.) [ group_interval: | default = 5m ] # How long to wait before sending a notification again if it has already # been sent successfully for an alert. (Usually ~3h or more). [ repeat_interval: | default = 4h ]
However, although this explanation seems straightforward, how these timers actually work is poorly understood. Today we will look in-depth at how Alertmanager works in order to fully understand the Group wait, Group interval and Repeat interval timers.
How it works
The Alertmanager is responsible for receiving alerts, de-duplicating and grouping them; and sending notifications about firing and resolved alerts to a number of configured receivers, such as email or Slack.
Upon receiving an alert, Alertmanager routes the alert through the top-level route in the configuration file. You can think of it as the incoming route for all alerts. If the top-level route has child routes, the alert is passed through each child route until the first match (an alert can match multiple routes if continue is set to true). If no child routes match, the matching route is the top-level route.
For each matching route, alerts are grouped together into aggregation groups based on the labels in the group_by
for the route. For example, if a route has a group_by
of [foo]
then all alerts with the label foo=bar
are grouped together into one aggregation group and all alerts with the label foo=baz
are grouped together into another aggregation group. On the other hand, if all alerts have the label foo=bar
then these alerts are grouped together in the same aggregation group instead of two separate aggregation groups as in the first example.
Once alerts have been put into aggregation groups, the alerts wait to be "flushed". When an aggregation group is flushed, its alerts are copied and sent through the notification pipeline. The notification pipeline is responsible for sending notifications about firing and resolved alerts in each aggregation group to all integrations in the matching route's receiver.
This flushing is where Group wait and Group interval come in.
Group wait
Group wait is the time between Alertmanager creating a new aggregation group and its first flush. To better understand this let's look at an example.
Here we have an Alertmanager configuration containing a route with a Group wait of 1m, a Group interval of 5m and a Repeat interval of 4h; grouped by the labels [foo]
such that all alerts with the label foo=bar
are grouped together into one aggregation group and all alerts with the label foo=baz are grouped together into another aggregation group.
route: receiver: email group_by: [foo] group_wait: 1m group_interval: 5m repeat_interval: 4h
If an alert is sent to Alertmanager containing the label foo=bar
, and there are no other alerts with the the label foo=bar
, Alertmanager will create a new aggregation group for foo=bar
and add the alert to the aggregation group. It will wait 1 minute to allow for other alerts with the the label foo=bar
to arrive before flushing the aggregation group and sending the first notification.
Group interval
Group interval is similar to Group wait. However, where Group wait is the time between Alertmanager creating a new aggregation group and its first flush, Group interval is the time between successive flushes for existing aggregation groups. To understand this better let's use the same example as we used to explain Group wait.
Alertmanager receives an alert with labels {foo=bar, bar=baz}
and matches it against the route (remember this route is grouped by the labels [foo]
). There are no other alerts with the label foo=bar
so the Alertmanager creates a new aggregation group and starts a timer for Group wait. No other alerts arrive within Group wait so the Alertmanager flushes the group, sends the alerts through the notification pipeline and starts a timer for Group interval.
The Alertmanager then receives another alert with labels {foo=bar, bar=qux}
and matches it against the same route. However, there is an existing aggregation group for foo=bar. Alertmanager adds the alert {foo=bar, bar=qux}
to the existing aggregation group such that the aggregation group now contains both {foo=bar, bar=baz}
and {foo=bar, bar=qux}
. The aggregation group will be flushed once again after Group interval. However, note that that Group interval is reset after each flush, not each time an alert is added or removed from an aggregation group.
Repeat interval
Repeat interval is different from Group wait and Group interval as it is unrelated to the flushing of aggregation groups. Instead, Repeat interval is checked once an aggregation group has been flushed.
When Alertmanager flushes an aggregation group after either Group wait (if it is a new aggregation group) or Group interval (if it is an existing aggregation group), the alerts are copied and sent into the notification pipeline. If the alerts in the aggregation group have changed since the last flush then a notification is sent. However, if the alerts have not changed since the last flush then Alertmanager checks the Repeat interval. If the time since the last notification is more than the Repeat interval a notification is sent, otherwise the flush is ignored.
Repeat intervals shorter than the Group interval are ignored, and the earliest a notification can be repeated is at the next Group interval. Similarly, Repeat intervals that are not a multiple of Group interval will occur at the next Group interval after the Repeat interval.
Summary
While Group wait, Group interval and Repeat interval might seem quite straightforward at first, understanding how these timers actually work is more complicated, and requires a deeper understanding of how Alertmanager works. I hope that you now have a much better understanding of Alertmanager's Group wait, Group interval and Repeat interval timers. If you still have unanswered questions about these timers I highly recommend reading the source code and experimenting with different configurations to really observe and understand how they work in practice.