How we added support for UTF-8 in Alertmanager
1st April, 2024
Alertmanager 0.27.0 has just been released with support for UTF-8. This means Alertmanager can receive alerts with UTF-8 characters such as 🙂 and words such as こんにちは (Kon'nichiwa) in the names of labels and annotations, can match alerts with UTF-8 characters in order to route and group related alerts together, silence them with silences, and everything else that is expected from Alertmanager.
In this post I will talk about why supporting UTF-8 characters was necessary and how we did it. While most of the changes we made were quite simple given that the Go programming language supports UTF-8 strings, one specific change was much more complicated, requiring a whole new parser and an accompanying compatibility framework to ensure backwards compatibility was maintained as much as possible.
The Prometheus data model
The Prometheus data model has a number of restrictions on labels:
- Labels may contain ASCII letters, numbers, as well as underscores. They must match the regex
[a-zA-Z_][a-zA-Z0-9_]*
. - Label names beginning with __ (two "_") are reserved for internal use.
- Label values may contain any Unicode characters.
These restrictions mean that labels such as foo🙂=bar
and こんにちは=世界
are invalid because neither foo🙂
nor こんにちは
match the regular expression [a-zA-Z_][a-zA-Z0-9_]*
.
However, labels such as foo=🙂bar
and hello=世界
are valid because while label names must match the regular expression [a-zA-Z_][a-zA-Z0-9_]*
, label values may contain any Unicode characters.
But this is how Prometheus (and Alertmanager) has worked since the very beginning. What's wrong with it and why does it need to change?
Why does this need to change?
Why does the Prometheus data model need to support labels with Unicode characters in the label name? The answer is to support the OpenTelemetry specification, which does not have the same restrictions as Prometheus. How exactly OpenTelemetery support will look in Prometheus is uncertain at the time of writing as it is still being discussed, however we know as far as Alertmanager is concerned UTF-8 characters will need to be permitted in the names of labels and annotations.
So why not just remove the first restriction and say a label name can also contain any Unicode character? Why does this need a blog post? Well, that's exactly what we did. However, doing so caused a big problem for label matchers. Let me explain...
Label matchers
Label matchers are a simple DSL (Domain-specific language) based on PromQL that match labels. For example, the label matcher {foo="bar"}
matches the label foo=bar
, and the label matcher {foo=~"bar|baz"}
matches both foo=bar
and foo=baz
because it contains the regular expression bar|baz
.
You use label matchers to route alerts in your Alertmanager configuration, mute alerts using inhibition rules and silences, and query alerts using the Alertmanager API and web interface. For example:
receiver: email group_by: [foo] matchers: - foo=~bar|baz
But what's the problem with supporting UTF-8 label names in label matchers?
A quick introduction to parsers
For Alertmanager to understand that the label matcher {foo="bar"}
should match the label foo=bar
, it must first translate the DSL into another representation that it can understand, specifically a set of predicates that it can execute. For example:
{ label_name: "foo", operator: Equals, label_value: "bar" }
The process of turning a language understandable by a human into another representation understandable by a computer is called parsing.
There are a number of different kinds of parser, and different kinds of parsers are better suited at parsing different kinds of languages. The label matchers DSL is a regular language, which means label matchers can be parsed by both regular expressions and other kinds of parsers such as top-down parsers.
The set of rules that determine the syntax for any language, and what kind of language it is, is known as its grammar. We can write grammars using a notation system called BNF (Backus–Naur form).
The parser that was previously used by Alertmanager to parse label matchers used a number of regular expressions to extract the name
, operator
and value
from label matchers. While it can be hard for parsers that use regular expressions to return meaningful error messages for invalid inputs, as the input either matches the regular expression or it doesn't; in general there is nothing wrong with using regular expressions to parse regular languages.
In fact, the problem with the regular expression parser was not that it used regular expressions, but that its grammar did not allow UTF-8 characters in the name
of a label matcher, and it turned out that while trying to change this we found a whole other set of problems...
Grammars
The grammar for label matchers can be written in BNF as follows:
<expr> ::= {" <sequence> "}" | <sequence> <sequence> ::= <sequence> "," <matcher> | <matcher> | "" <matcher> ::= <label_name> <operator> <label_value> <label_name> ::= [a-zA-Z_:][a-zA-Z0-9_:]* <operator> ::= "=" | "=~" | "!=" | "!~" <label_value> ::= .+
Each line contains the definition of a symbol. Symbols that reference other symbols are called non-terminals, and symbols that don't are called terminals. Here <expr>
is a non-terminal because it references <sequence>
and <matcher>
, however <label_name>
is a terminal as it does not reference any other symbol.
Let's break down the grammar one line at a time...
- An expression in the label matchers DSL has optional open and close braces, and a sequence.
- A sequence can be recursive, which allows for writing multiple label matchers like
foo=bar,bar=baz
, a single matcher likefoo=bar
, or an empty string. - A matcher must have a label name, an operator, and a label value.
- A label name must match the regular expression
[a-zA-Z_:][a-zA-Z0-9_:]*
. - An operator must be either
=
,=~
,!=
or!~
. - A label value may contain any Unicode characters or an empty string.
To better understand this let's take a simple label matcher such as foo=bar
and parse it using the grammar ourselves:
- The input starts at the
<expr>
non-terminal. It does not have curly braces so it cannot be a{" <sequence> "}"
, but it might be a<sequence>
. - The input does not contain a comma so it cannot be a
<sequence> "," <matcher>
, and it is not an empty string so it cannot be""
, but it might be a<matcher>
. - The start of the input is a
<label_name>
becausefoo
matches the regular expression[a-zA-Z_:][a-zA-Z0-9_:]*
. - The next part of the input is a
<operator>
because=
is in the set of valid operators. - The final part of the input is a
<label_value>
because it contains all valid Unicode characters.
So how do we know that foo=bar
is a valid label matcher? Well, because every character matched a terminal symbol. If any part of it did not then it would not be a valid label matcher.
Parsing ambiguities
What if the label value contains a comma? If we take the label matcher foo=bar,
as an example, is the comma part of the label value (because it is a valid Unicode character) or is it a separator between two label matchers? This is known as a parsing ambiguity (annoyingly the regular expression parser will either parse ,
as part of the label value or parse it as a separator between two label matchers depending on whether you call the ParseMatcher
or ParseMatchers
function).
To fix this we can disambiguate the grammar by saying a label value may either be a double quoted string containing any Unicode character, or an unquoted string containing any Unicode character except commas:
<label_value> ::= <quoted> | <unquoted> <quoted> ::= (\"(\\.|[^\"])*\") <unquoted> ::= [^,]+
This gives us a new grammar:
<expr> ::= {" <sequence> "}" | <sequence> <sequence> ::= <sequence> "," <matcher> | <matcher> | "" <matcher> ::= <label_name> <operator> <label_value> <label_name> ::= [a-zA-Z_:][a-zA-Z0-9_:]* <operator> ::= "=" | "=~" | "!=" | "!~" <label_value> ::= <quoted> | <unquoted> <quoted> ::= (\"(\\.|[^\"])*\") <unquoted> ::= [^,]+
But even this grammar still accepts weird and in some cases ambiguous label matchers. For example:
foo=bar" // (foo) (=) ("bar\"") foo==bar // (foo) (=) ("=bar") foo!==bar // (foo) (!=) ("=bar") {foo=bar}},} // (foo) (=) ("bar}},") {foo=,bar=}} // (foo) (=) (""), (bar) (=) ("}")
How are any of these valid label matchers? Well if you follow the grammar you will see that bar"
, =bar
, bar}},
and ,bar=}
are all valid <unquoted>
terminals. And remember that regular expressions are greedy, so will match as much of the input as possible. Let's take a quick look at the differences between lazy and greedy matching.
Lazy matching
Lazy matching attempts to match a rule against the smallest possible substring in the input. In the case of parsing the label matcher foo!==bar
we scan the input from left to right until we find the first character that matches an <operator>
. This is our label name. That means foo!==bar
would be parsed as (foo) (!=) (=bar)
.
However, while the rules of lazy matching are relatively simple, it still needs a certain amount of greediness to correctly parse label matchers such as foo=~bar
that would otherwise be parsed as (foo) (=) (~bar)
instead of (foo) (=~) (bar)
.
Greedy matching
Greedy matching attempts to match a rule against the longest possible substring in the input. In the case of parsing a <label_name>
we scan the input until we find the last substring that matches the definition of an operator. This means the previous example would be parsed as (foo!=) (=) (bar)
.
The problem with greedy matching is that it also requires backtracking which is where the parser steps back in the input to consider other options when the greediest doesn't work. Let's look at an example of this:
foo!=bar==
With greedy matching we ignore the first !=
as we are looking for last substring that matches the definition of an operator. That would be the last =
at the end of the input. The problem is if we parse the label name as foo!=bar=
, we have an operator =
but we don't have a label value.
Instead we need to backtrack and find the second to last substring that matches the definition of an operator. With backtracking one step this would be parsed as (foo!=bar) (=) (=)
, not (foo) (!=) (bar==)
as you might have expected.
Backtracking make the parser more complex, more difficult to write, and slower (because it needs to parse the rest of the input again each time it backtracks). It can also mean we parse label matchers in unexpected ways, as you just witnessed. We can do better.
Change the grammar, stupid!
Instead of trying to disambiguate these ambiguities at parse time, what if we just removed them from the grammar entirely? If we change the grammar to disallow certain characters outside double quotes, we wouldn't have these ambiguities anymore.
Let's add the following restriction to <unquoted>
:
<unquoted> ::= [^,{}!=~,\\\"'`]+
This gives us a new grammar:
<expr> ::= {" <sequence> "}" | <sequence> <sequence> ::= <sequence> "," <matcher> | <matcher> | "" <matcher> ::= <label_name> <operator> <label_value> <label_name> ::= [a-zA-Z_:][a-zA-Z0-9_:]* <operator> ::= "=" | "=~" | "!=" | "!~" <label_value> ::= <quoted> | <unquoted> <quoted> ::= (\"(\\.|[^\"])*\") <unquoted> ::= [^,{}!=~,\\\"'`]+
Let's parse foo!==bar
again using this new restricted grammar:
- The input starts at the
<expr>
non-terminal. It does not have curly braces so it cannot be a{" <sequence> "}"
, but it might be a<sequence>
. - The input does not contain a comma so it cannot be a
<sequence> "," <matcher>
, and it is not an empty string so it cannot be""
, but it might be a<matcher>
. - The start of the input is a
<label_name>
becausefoo
matches the regular expression[a-zA-Z_:][a-zA-Z0-9_:]*
. - The next part of the input is a
<operator>
because!=
is in the set of valid operators. - However the final part of the input cannot be a
<label_value>
because it starts with an unquoted=
.
This means foo!==bar
is not a valid label matcher. If we want to make it valid, it must be rewritten as foo!="=bar"
such that "=bar"
matches the terminal <quoted>
.
Let's also take another look at the examples from earlier:
foo=bar" // invalid foo==bar // invalid foo!==bar // invalid {foo=bar}},} // invalid {foo=,bar=}} // invalid
It seems we have fixed all the ambiguities from the original grammar, but how can we be sure? We will know if our grammar is ambigious if we can end up with two (or more) different parse trees from the same input.
If we try to parse foo!==bar
, well it is no longer a valid input.
If we try to parse foo!="=bar"
it can only be parsed as (foo) (!=) (=bar)
. It cannot be parsed as (foo!=) (=) (bar)
because the =bar
is double-quoted, and it cannot be parsed as (foo!) (=) (=bar)
either because the !
character is not permitted outside double-quotes.
With the grammar fixed, let's now look at how we can modify it again to support UTF-8 characters in <label_name>
instead of being restricted by the regular expression [a-zA-Z_:][a-zA-Z0-9_:]*
.
UTF-8 characters in the label name
What if we just copied the definition for <label_value>
use it for <label_name>
?
<label_name> ::= <quoted> | <unquoted> <label_value> ::= <quoted> | <unquoted> <quoted> ::= (\"(\\.|[^\"])*\") <unquoted> ::= [^,{}!=~,\\\"'`]+
Then we could write label matchers with UTF-8 characters in both the label name and the label value. But why not just do that to begin with? Why did we need to do all that work earlier removing ambiguous cases from the grammar?
Well, it means that we can now have the same non-terminals for both <label_name>
and <label_value>
without any ambiguities. This means the rules for what is a valid label name and what is a valid label value are exactly the same. You don't need to remember any special cases or use a different syntax depending on if you're on the left or right hand side of the operator. What we've ended up with is a language that is consistent, that is easy to read, easy to write, easy to understand and also easy to parse.
Here are a couple of examples of label matchers that satisfy this grammar:
🙂=🙂 "🙂"="🙂" foo🙂=bar "foo🙂"="bar" "foo!="="!=bar" "foo\""="has escaped quotes" こんにちは=世界 "こんにちは"="世界"
Why not enforce double-quoting and remove unquoted strings?
It's a good question, why didn't we enforce double-quoting everywhere and remove <unquoted>
from the grammar?
<label_name> ::= <quoted> <operator> ::= "=" | "=~" | "!=" | "!~" <label_value> ::= <quoted> <quoted> ::= (\"(\\.|[^\"])*\")
I originally wanted to do this, but having discussed it with Beorn and Josh (fellow Prometheus contributors), we decided it would break a lot of Alertmanager configurations unnecessarily. Supporting both quoted and unquoted strings, with unquoted strings having additional restrictions, felt like the best compromise where we could minimize breaking changes to the grammar while still removing all existing parsing ambiguities and at the same time support UTF-8 characters in label names.
The parser
While it would have been possible to change the existing regular expression parser to parse this new grammar, I decided to write a new parser in Go based on Rob Pike's talk on Lexical Scanning in Go. There were a couple of reasons for this:
- I wanted to have a separate lexer that could be re-used in future to also parse labels and annotations, not just label matchers.
- I wanted to provide more meaningful error messages for invalid label matchers, and this is much more difficult to do with regular expressions.
- I believed we could parse label matchers faster and with fewer memory allocations if we wrote our own parser.
The parser itself is a simple LL parser, which is a kind of top-down parser. It uses 1 character look-ahead to determine the end of the current symbol and the start of the next symbol (technically it is a specific kind of LL parser known as an LL(1) parser).
The lexer iterates the input from left to right and emits tokens for each terminal that matches the grammar, and returns an error as soon as it encounters a sequence that does not match a terminal. The parser is implemented as a finite-state automata that moves between states (technically function pointers) until it encounters an error or the end of the input (just as in Rob's talk).
This approach made it much easier to return context-rich error messages when compared to using regular expressions. Here are some examples of the improved error messages:
{foo 0:4: end of input: expected an operator such as '=', '!=', '=~' or '!~'
{foo=bar 0:8: end of input: expected close paren
foo=bar} 0:8: }: expected opening paren
{foo=bar,,} 9:10: unexpected ,: expected a matcher or close paren after comma
Benchmarks
The following benchmarks measure the performance difference between the two parsers when parsing 4 different label matchers named Simple
, Complex
, RegexSimple
and RegexComplex
. All benchmarks with the prefix BenchmarkMatchers
benchmark the new parser and all benchmarks with the prefix BenchmarkPrometheus
benchmark the regular expression parser.
BenchmarkMatchersSimple, BenchmarkPrometheusSimple {foo="bar"} BenchmarkMatchersComplex, BenchmarkPrometheusComplex {foo="bar",bar="foo 🙂","baz"!=qux,qux!="baz 🙂"} BenchmarkMatchersRegexSimple, BenchmarkPrometheusRegexSimple {foo=~"[a-zA-Z_:][a-zA-Z0-9_:]*"} BenchmarkMatchersRegexComplex, BenchmarkPrometheusRegexComplex {foo=~"[a-zA-Z_:][a-zA-Z0-9_:]*",bar=~"[a-zA-Z_:]","baz"!~"[a-zA-Z_:][a-zA-Z0-9_:]*",qux!~"[a-zA-Z_:]"}
And the results of those benchmarks:
go test -bench=. -benchmem goos: darwin goarch: arm64 pkg: github.com/grobinson-grafana/matchers-benchmarks BenchmarkMatchersRegexSimple-8 488295 2425 ns/op 3248 B/op 49 allocs/op BenchmarkMatchersRegexComplex-8 138081 9074 ns/op 11448 B/op 169 allocs/op BenchmarkPrometheusRegexSimple-8 329244 3496 ns/op 3531 B/op 58 allocs/op BenchmarkPrometheusRegexComplex-8 95188 12554 ns/op 12619 B/op 204 allocs/op BenchmarkMatchersSimple-8 2888340 414.9 ns/op 56 B/op 2 allocs/op BenchmarkMatchersComplex-8 741590 1628 ns/op 248 B/op 7 allocs/op BenchmarkPrometheusSimple-8 1919209 613.9 ns/op 233 B/op 8 allocs/op BenchmarkPrometheusComplex-8 425430 2803 ns/op 1015 B/op 31 allocs/op PASS ok github.com/grobinson-grafana/matchers-benchmarks 11.766s
We can see that the new parser is between 70 to 80% faster on average and reduces memory allocations by the same amount when compared to the regular expression parser.
Backwards compatibility and compliance
My main concern with restricting the grammar and adding support for UTF-8 label names was making sure a label matcher couldn't parse one way with the old grammar and a different way with the new grammar. For example, if the label matcher foo!==bar
parsed as (foo!=) (=) (bar)
before and (foo) (!=) (=bar)
after (or the other way around).
To prevent this from happening the new grammar had to satisfy a number of invariants:
- Any label matcher that is ambiguous cannot be parsed by the new grammar.
- Any label matcher that can be parsed by the old grammar and parsed by the new grammar must parse the same.
- The only exception may be escape sequences inside double quotes. For example
foo="\xf0\x9f\x99\x82"
may be unescaped such that\xf0\x9f\x99\x82
becomes an 🙂 emoji instead of a literal"\\xf0\\x9f\\x99\\x82"
.
Any cases where the old grammar and new grammar parsed the same input differently would be referred to as "disagreement". To identify any cases of disagreement I added a compatibility layer to the Alertmanager source code that parsed every label matcher twice, once in the regular expression parser and again in the new parser, and then compared their outputs. If both parsers successfully parsed an input but produced different outputs then we had found a case of disagreement.
As of January 2025, Alertmanager has had UTF-8 support for almost 1 year. We plan to remove the compatibility layer from Alertmanager in the future.
Summary
I hope you enjoyed reading this post as much as I did writing it. In summary, I talked about why supporting UTF-8 characters was necessary and how we did it. While most of the changes we made were quite simple given that the Go programming language supports UTF-8 strings, supporting UTF-8 for label names in label matchers was much more complicated. It is available and enabled by default (with the compatibility layer) in Alertmanager versions 0.27 and newer.