How we added support for UTF-8 in Alertmanager

1 April 2024 — 2969 words — approx 15 minutes

Alertmanager 0.27.0 has just been released with support for UTF-8. This means Alertmanager can receive alerts with UTF-8 characters such as 🙂 and words such as こんにちは (Kon'nichiwa) in the names of labels and annotations, can match alerts with UTF-8 characters in order to route and group related alerts together, silence them with silences, and everything else that is expected from Alertmanager.

In this post I will talk about why supporting UTF-8 characters was necessary and how we did it. While most of the changes we made were quite simple given that the Go programming language supports UTF-8 strings, one specific change was much more complicated, requiring a whole new parser and an accompanying compatibility framework to ensure backwards compatibility was maintained as much as possible.

How capacity hints work in Go

24 September 2023 — 1804 words — approx 10 minutes

Today we will look at what capacity hints are in Go and how they work for slices and maps. We will see that when creating slices the Go runtime always allocates the requested capacity. However when creating maps the Go runtime uses lazy allocation for small hints, and allocates more capacity than hinted for large hints. Hopefully by the end of this page you will have a much better understanding of how both slices and maps work internally and how their memory is allocated and resized by the Go runtime.

Tracking down data corruption in Alertmanager notifications

16 September 2023 — 769 words — approx 4 minutes

Back in March of this year, a Grafana user opened an issue on GitHub reporting data corruption in their Alertmanager notifications. Sometimes, although not always, text such as "Waltz, bad nymph, for quick jigs vex" would come out as "Waltz, BaD", "Nym", "Wph, For icc Jigs Vex" or some variant thereof.

The user in question happened to be using their own notification templates instead of the default templates. This is not uncommon, in fact quite the opposite. However, in this case data corruption seemed to be occuring somewhere in the template, and the user had suspected that the cause was a single line that looked similar to this:

{{ reReplaceAll "Waltz" "Fox" .Message | title }}

The template itself is quite simple. It replaces all occurences of "Waltz" with "Fox" in the message and then transforms the result into title case. The issue could be in one of the functions, the piping of data from one function to another, or even in text/template.

Best practices for avoiding race conditions in inhibition rules

24 June 2023 — 1099 words — approx 6 minutes

On the surface of it inhibition rules in Prometheus seem incredibly simple. You have a rule that when fires inhibits the alerts of one or more other rules. What more to it could there be?

Well it may surprise you to hear that there are a number of subtle cases where your inhibition rules might not work as you would expect, often due to a race condition between the inhibiting rules and the rules they inhibit. Today we will look at some best practices for avoiding race conditions in inhibition rules that when followed will ensure your inhibition rules always work reliably.

Group wait, Group interval and Repeat interval explained

1 June 2023 — 978 words — approx 5 minutes

There seems to be a lot of confusion around the Group wait, Group interval and Repeat interval timers in Prometheus Alertmanager. The Prometheus documentation explains them as follows:

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ]

However, although this explanation seems straightforward, how these timers actually work is poorly understood. Today we will look in-depth at how Alertmanager works in order to fully understand the Group wait, Group interval and Repeat interval timers.

Alertmanager crash tolerance and missing notifications

26 May 2023 — 1293 words — approx 7 minutes

The Prometheus Alertmanager is responsible for receiving alerts, deduplicating and grouping them; and sending notifications about firing and resolved alerts to a number of configured receivers, such as email or Slack. However, despite being such a well known and popular open source project with widespread use, it seems to me that Alertmanager is not well understood. One such example of this is the promise Alertmanager makes around delivering notifications for alerts it has received, and reports from users of missing resolved notifications.

It is regarded that Alertmanager has at-least once behaviour. Alertmanager might deliver the same notification multiple times, and after some unbounded amount of time, but a notification will be delivered. However, what happens if Alertmanger crashes or restarts in between receiving an alert and sending the notification? Before we can answer at this question, let's first look at how the Alertmanager works.

grobinson.net 👋