VictoriaMetrics
: Effective alerts, from theory to practice π οΈ
Overview
Once our application is deployed, it is essential to have indicators that help identify potential issues and track performance changes. Among these sources of information, metrics and logs play an essential role by providing valuable insights into the application's operation. Additionally, it is often useful to implement detailed tracing to accurately track all actions performed within the application.
In this series of blog posts, we will explore the various areas of application monitoring. The goal is to thoroughly analyze the state of our applications, in order to improve their availability and performance, while ensuring an optimal user experience.
In a previous blog post, we've seen how to collect and visualize metrics. These metrics allow us to analyze our applications' behavior and performance. It's also crucial to configure alerts to be notified of misbehaviours on our platform.
π― Our targets
- π Understand standard approaches for defining effective alerts: "Core Web Vitals" and "Golden Signals"
- π Discover PromQL and MetricsQL languages for writing alert rules
- βοΈ Configure alerts declaratively with VictoriaMetrics Operator
- π± Route these alerts to different Slack channels
π Prerequisites
Here we assume you already have:
- A working VictoriaMetrics instance deployed on a Kubernetes cluster
- Access to a Slack workspace for notifications
Setting up relevant alerts essential any observability strategy. However, defining appropriate thresholds and avoiding alert fatigue requires a thoughtful and methodical approach.
We'll see in this article that it's very easy to set thresholds beyond which we would be notified. However, making these alerts relevant isn't always straightforward.
π What Makes a Good Alert?

A properly configured alert allows us to identify and resolve problems within our system proactively, before the situation becomes worse. Effective alerts should:
- Signal problems requiring immediate intervention
- Be triggered at the right time: early enough to prevent user impact, but not so frequently as to cause alert fatigue
- Indicate the root cause or area requiring investigation. To achieve this, it's recommended to perform an analysis that prioritizes relevant metrics that directly reflect service quality and user experience (SLIs)
Therefore, it's important to focus on a controlled number of metrics to monitor. There are approaches that allow us to implement effective monitoring of our systems. Here we'll focus on two widely used alert models: Core Web Vitals and Golden Signals.
π The "Core Web Vitals"
Core Web Vitals are metrics developed by Google to evaluate the user experience on web applications. They highlight metrics related to end-user satisfaction and help ensure our application offers good performance for real users. These metrics focus on three main aspects:

Largest Contentful Paint (LCP), Page Load Time: LCP measures the time needed for the largest visible content element on a web page (for example, an image, video, or large text block) to be fully rendered in the web browser. A good LCP is below 2.5 seconds.
Interaction to Next Paint (INP), Responsiveness: INP evaluates a web page's responsiveness by measuring the latency of all user interactions, such as clicks, taps, and keyboard inputs, etc. It reflects the time needed for a page to visually respond to an interaction, that is, the delay before the browser displays the next render after a user action. A good INP should be less than 200 milliseconds
Cumulative Layout Shift (CLS), Visual Stability: CLS evaluates visual stability by quantifying unexpected layout shifts on a page, when elements move during loading or interaction. A good CLS score is less than or equal to 0.1.
A website's performance is considered satisfactory if it reaches the thresholds described above at the 75th percentile, thus favoring a good user experience and, consequently, better retention and search engine optimization (SEO).
Adding specific alerts for these metrics requires careful consideration. Unlike classic metrics, such as availability or error rates, which directly reflect system stability, Web Vitals depend on many external factors, such as users' network conditions or their devices, making thresholds more complex to monitor effectively.
To avoid unnecessary alert overload, these alerts should only target significant degradations. For example, a sudden increase in CLS (visual stability) or a continuous deterioration of LCP (load time) over several days might indicate important problems requiring intervention.
Finally, these alerts require appropriate tools, such as RUM (Real User Monitoring) for real data or Synthetic Monitoring for simulated tests, which require a specific solution not covered in this article.
β¨ The "Golden Signals"

The Golden Signals are a set of four key metrics, widely used in the field of system and application monitoring, particularly with tools like Prometheus. These signals allow effective monitoring of application health and performance. They are particularly appropriate in the context of a distributed architecture:
Latency β³: It includes both successful request time and failed request time. Latency is crucial because an increase in response time can indicate performance problems.
Traffic πΆ: It can be measured in terms of requests per second, data throughput, or other metrics that express system load.
Errors β: This is the failure rate of requests or transactions. This can include application errors, infrastructure errors, or any situation where a request didn't complete correctly (for example, HTTP 5xx responses or rejected requests).
Saturation π: This is a measure of system resource usage, such as CPU, memory, or network bandwidth. Saturation indicates how close the system is to its limits. A saturated system can lead to slowdowns or failures.
These Golden Signals are essential because they allow us to focus monitoring on critical aspects that can quickly affect user experience or overall system performance. With Prometheus, these signals are often monitored via specific metrics to trigger alerts when certain thresholds are exceeded.
I've mentioned here two methodologies that I find are a good starting point for optimizing our alerting system. That said, others exist, each with their specificities. We can mention USE or RED among others.
Similarly, beyond the Core Web Vitals presented above, other web metrics like FCP (First Contentful Paint) or TTFB (Time To First Byte) can prove useful depending on your specific needs.
The main thing is to keep in mind that a good alerting strategy relies on a targeted set of relevant metrics π―
You got it: Defining alerts requires thought! Now let's get practical and see how to define thresholds from our metrics.
π Understanding PromQL and MetricsQL Query Languages
Metrics collected with Prometheus can be queried using a specific language called PromQL
(Prometheus Query Language). This language allows extracting monitoring data, performing calculations, aggregating results, applying filters, and also configuring alerts.
(βΉοΈ Refer to the previous article to understand what we mean by metric.)
PromQL is a powerful language, here are some simple examples applied to metrics exposed by an Nginx web server:
Total number of processed requests (
nginx_http_requests_total
) - returns the total count since server start:1nginx_http_requests_total
Request rate over a 5-minute window - calculates requests per second:
1rate(nginx_http_requests_total[5m])
Error rate - calculates 5xx errors per second over the last 5 minutes:
1rate(nginx_http_requests_total{status=~"5.."}[5m])
Request rate by pod - calculates requests/sec for each pod in namespace "myns":
1sum(rate(nginx_http_requests_total{namespace="myns"}[5m])) by (pod)
π‘ In the examples above, we made use of two Golden Signals: traffic πΆ and errors β.
MetricsQL
is the language used with VictoriaMetrics. It aims to be compatible with PromQL with slight differences that make it easier to write complex queries.It also brings new functions, here are some examples:
histogram(q)
: This function calculates a histogram for each group of points having the same timestamp, which is useful for visualizing a large number of time series via a heatmap.To create a histogram of HTTP requests:1histogram(rate(vm_http_requests_total[5m]))
quantiles("phiLabel", phi1, ..., phiN, q)
: Used to extract multiple quantiles (or percentiles) from a given metric.To calculate the 50th, 90th, and 99th percentiles of HTTP request rate:1quantiles("percentile", 0.5, 0.9, 0.99, rate(vm_http_requests_total[5m]))
To test your queries, you can use the demo provided by VictoriaMetrics: https://play.victoriametrics.com

π οΈ Configuring Alerts with the VictoriaMetrics Operator
VictoriaMetrics offers two essential components for alert management:
- VMAlert: responsible for evaluating alert rules
- AlertManager: manages routing and distribution of notifications
VMAlert: The Rule Evaluation Engine
VMAlert is the component that continuously evaluates defined alert rules. It supports two types of rules:
Recording Rules π Recording rules allow pre-calculating complex PromQL expressions and storing them as new metrics to optimize performance.
Alerting Rules π¨ Alerting rules define conditions that trigger alerts when certain thresholds are exceeded.
In this blog post, we'll focus on alerting rules which are essential for proactive problem detection.
![]() | The rest of this article comes from a set of configurations you can find in the Cloud Native Ref repository.It uses many operators, including the one for VictoriaMetrics. This project aims to quickly start a complete platform that applies best practices in terms of automation, monitoring, security, etc.Comments and contributions are welcome π |
Declaring an Alerting Rule with VMRule
We've seen previously that VictoriaMetrics provides a Kubernetes operator that allows managing different components declaratively. Among the available custom resources, VMRule
allows defining alerts and recording rules.
If you've already used the Prometheus operator, you'll find a very similar syntax as the VictoriaMetrics operator is compatible with Prometheus custom resources. (This allows to migrate easily π).
Let's take a concrete example with a VMRule
that monitors the health state of Flux
resources:
flux/observability/vmrule.yaml
1apiVersion: operator.victoriametrics.com/v1beta1
2kind: VMRule
3metadata:
4 labels:
5 prometheus-instance: main
6 name: flux-system
7 namespace: flux-system
8spec:
9 groups:
10 - name: flux-system
11 rules:
12 - alert: FluxReconciliationFailure
13 annotations:
14 message: Flux resource has been unhealthy for more than 5m
15 description: "{{ $labels.kind }} {{ $labels.exported_namespace }}/{{ $labels.name }} reconciliation has been failing for more than ten minutes."
16 runbook_url: "https://fluxcd.io/flux/cheatsheets/troubleshooting/"
17 dashboard: "https://grafana.priv.${domain_name}/dashboards"
18 expr: max(gotk_reconcile_condition{status="False",type="Ready"}) by (exported_namespace, name, kind) + on(exported_namespace, name, kind) (max(gotk_reconcile_condition{status="Deleted"}) by (exported_namespace, name, kind)) * 2 == 1
19 for: 10m
20 labels:
21 severity: warning
It's recommended to follow some best practices to provide maximum context for quickly identifying the root cause.
Naming and Organization π
- Use descriptive names for rules, like
FluxReconciliationFailure
- Group rules by component (ex:
flux-system
,flux-controllers
) - Document reconciliation conditions in annotations
- Use descriptive names for rules, like
Thresholds and Durations β±οΈ
- Adjust alert evaluation duration
for: 10m
to avoid false positives - Adapt thresholds according to the type of monitored resources
- Consider different durations depending on the environment (prod/staging)
- Adjust alert evaluation duration
Labels and Routing π·οΈ
- Add labels for routing according to context. My example isn't very advanced as it's a demo configuration. But we could very add, for instance, a
team
label to route to the right team, or have different routing policies depending on the environment.1labels: 2 severity: [critical|warning|info] 3 team: [sre|dev|ops] 4 environment: [prod|staging|dev]
- Add labels for routing according to context. My example isn't very advanced as it's a demo configuration. But we could very add, for instance, a
The Importance of Annotations π
Annotations allow adding various information about the alert context
- A clear description of the reconciliation problem
- The link to the runbook for Flux troubleshooting
- The link to the dedicated Grafana dashboard
- PromQL Query πThis alert will trigger if Flux fails to reconcile a resource. In detail:
1expr: | 2 max(gotk_reconcile_condition{status="False",type="Ready"}) by (exported_namespace, name, kind) 3 + on(exported_namespace, name, kind) 4 (max(gotk_reconcile_condition{status="Deleted"}) by (exported_namespace, name, kind)) * 2 == 1
- The
gotk_reconcile_condition
metric exposes the health state of Flux resources - The filter
status="False",type="Ready"
identifies resources that aren't in the "Ready" state - The second part of the expression (
status="Deleted"
) detects resources that have been deleted - The operation
+ on(...) (...) * 2 == 1
combines these conditions to trigger an alert when:- A resource isn't "Ready" (first part = 1) AND has not been deleted (second part = 0)
- OR a resource has been deleted (second part = 2) regardless of Ready state
- The
max
andby
allow grouping alerts by namespace, name, and resource type
- The
π¬ Integration with Slack
We can send these alerts through different channels or tools. We can mention Grafana OnCall, Opsgenie, PagerDuty, or simply emails, and more...
In our example, we're sending notifications to a Slack channel. We'll first create a Slack application and retrieve the generated token before configuring VictoriaMetrics.
Slack Application Configuration
Application Creation π§
- This is done on https://api.slack.com/apps
- Click on "Create New App"
- Choose "From scratch"
- Name the application (ex: "AlertManager")
- Select the target workspace
Permission Configuration π In "OAuth & Permissions", add the following scopes:
chat:write
(Required)chat:write.public
(For posting in public channels)channels:read
(For listing channels)groups:read
(For private groups)
- Installation and Token ποΈ
- Install the application in the workspace
- Copy the "Bot User OAuth Token" (starts with
xoxb-
) - Store the token securely. In our example, the secret is retrieved from AWS Secrets Manager using the External Secrets operator.
AlertManager Configuration for Slack
The rest of the configuration is done using Helm values to configure AlertManager
observability/base/victoria-metrics-k8s-stack/vm-common-helm-values-configmap.yaml
- Configure AlertManager to use the Slack token
1 alertmanager:
2 enabled: true
3 spec:
4 externalURL: "https://vmalertmanager-${cluster_name}.priv.${domain_name}"
5 secrets:
6 - "victoria-metrics-k8s-stack-alertmanager-slack-app"
7 config:
8 global:
9 slack_api_url: "https://slack.com/api/chat.postMessage"
10 http_config:
11 authorization:
12 credentials_file: /etc/vm/secrets/victoria-metrics-k8s-stack-alertmanager-slack-app/token
The External Secrets Operator retrieves the Slack token from AWS Secrets Manager and stores it in a Kubernetes secret named victoria-metrics-k8s-stack-alertmanager-slack-app
. This secret is then referenced in the Helm values to configure AlertManager's authentication (config.global.http_config.authorization.credentials_file
).
- Routing Explanation
1 route:
2 group_by:
3 - cluster
4 - alertname
5 - severity
6 - namespace
7 group_interval: 5m
8 group_wait: 30s
9 repeat_interval: 3h
10 receiver: "slack-monitoring"
11 routes:
12 - matchers:
13 - alertname =~ "InfoInhibitor|Watchdog|KubeCPUOvercommit"
14 receiver: "blackhole"
15 receivers:
16 - name: "blackhole"
17 - name: "slack-monitoring"
Alert Grouping: Alert grouping is important to reduce noise and improve notification readability. Without grouping, each alert would be sent individually, which could quickly become unmanageable. The chosen grouping criteria allow logical organization:
group_by
defines the labels to group alerts bygroup_wait
: 30s delay before initial notification to allow groupinggroup_interval
: 5m interval between notifications for the same grouprepeat_interval
: Alerts are only repeated every 3h to avoid spam
Receivers: Receivers are AlertManager components that define how and where to send alert notifications. They can be configured for different communication channels like Slack, Email, PagerDuty, etc. In our configuration:
slack-monitoring
: Main receiver that sends alerts to a specific Slack channel with custom formattingblackhole
: Special receiver that "absorbs" alerts without transmitting them anywhere, useful for filtering non-relevant or purely technical alerts
Alert routing can be customized based on your team structure and needs. Here's a practical example:
Let's say your organization has an on-call team that needs to be notified immediately about urgent issues. You can route alerts to them when:
- The alert comes from production or security environments
- The issue requires immediate attention from the on-call team
1 - matchers:
2 - environment =~ "prod|security"
3 - team = "oncall"
4 receiver: "pagerduty"
- Custom Templates π
This configuration block defines a Slack receiver for AlertManager that uses Monzo templates. Monzo templates are a set of notification templates that allow formatting Slack alerts in an elegant and informative way.
1 alertmanager:
2 config:
3 receivers:
4 - name: "slack-monitoring"
5 slack_configs:
6 - channel: "#alerts"
7 send_resolved: true
8 title: '{{ template "slack.monzo.title" . }}'
9 icon_emoji: '{{ template "slack.monzo.icon_emoji" . }}'
10 color: '{{ template "slack.monzo.color" . }}'
11 text: '{{ template "slack.monzo.text" . }}'
12 actions:
13 - type: button
14 text: "Runbook :green_book:"
15 url: "{{ (index .Alerts 0).Annotations.runbook_url }}"
16 - type: button
17 text: "Query :mag:"
18 url: "{{ (index .Alerts 0).GeneratorURL }}"
19 - type: button
20 text: "Dashboard :grafana:"
21 url: "{{ (index .Alerts 0).Annotations.dashboard }}"
22 - type: button
23 text: "Silence :no_bell:"
24 url: '{{ template "__alert_silence_link" . }}'
25 - type: button
26 text: '{{ template "slack.monzo.link_button_text" . }}'
27 url: "{{ .CommonAnnotations.link_url }}"
The notification format shown below demonstrates how alerts can be enriched with interactive elements. Users can quickly access relevant information through action buttons that link to the Grafana dashboard π, view the associated runbook π, or silence noisy alerts π when needed.

π Visualizing and Interacting with Alerts
VictoriaMetrics and its ecosystem provide multiple interfaces for managing and viewing alerts. Here are the main options available:
Alertmanager: The Standard Solution
Alertmanager
is the standard component that allows:
- Viewing current alert state
- Configuring notification routing
- Managing silences (temporarily pausing alerts)
- Consulting alert history

VMUI: The Native VictoriaMetrics Interface
VMUI
offers a simplified interface for:
- Viewing active alerts
- Visualizing alert rules
- Displaying associated metrics

Grafana Alerting: A Complete Solution
Although we use Alertmanager for alert definition and routing, Grafana Alerting
offers a complete alternative solution that allows:
- Centralizing alert management
- Viewing alerts in the context of dashboards
- Configuring alert rules directly from the interface
- Managing silences and notifications

The choice of interface depends on your specific needs:
- Alertmanager is ideal for operational alert management
- VMUI is perfect for a quick and simple view
- Grafana Alerting is recommended if you want a solution integrated with your dashboards
π― Conclusion
Defining relevant alerts is a key element of any observability strategy. The VictoriaMetrics operator, with its Kubernetes custom resources like VMRule
, greatly simplifies setting up an effective alerting system. Declarative configuration allows quickly defining complex alert rules while maintaining excellent code readability and maintainability.
However, the technical configuration of alerts, even with powerful tools like VictoriaMetrics, isn't sufficient on its own. An effective alerting strategy must integrate into a broader organizational framework:
- Clear definition of on-call procedures
- Identification of teams responsible for monitoring
- Implementation of runbooks and incident response procedures
- Adaptation of notification channels according to criticality and context
Discover how to integrate these alerts with other components of your observability stack in upcoming articles in this series, particularly correlation with logs and distributed tracing.