VictoriaMetrics : Effective alerts, from theory to practice πŸ› οΈ

Overview

How are our applications performing? πŸ‘οΈ

Once our application is deployed, it is essential to have indicators that help identify potential issues and track performance changes. Among these sources of information, metrics and logs play an essential role by providing valuable insights into the application's operation. Additionally, it is often useful to implement detailed tracing to accurately track all actions performed within the application.

In this series of blog posts, we will explore the various areas of application monitoring. The goal is to thoroughly analyze the state of our applications, in order to improve their availability and performance, while ensuring an optimal user experience.

In a previous blog post, we've seen how to collect and visualize metrics. These metrics allow us to analyze our applications' behavior and performance. It's also crucial to configure alerts to be notified of misbehaviours on our platform.

🎯 Our targets

  • πŸ“Š Understand standard approaches for defining effective alerts: "Core Web Vitals" and "Golden Signals"
  • πŸ” Discover PromQL and MetricsQL languages for writing alert rules
  • βš™οΈ Configure alerts declaratively with VictoriaMetrics Operator
  • πŸ“± Route these alerts to different Slack channels

πŸ“‹ Prerequisites

Here we assume you already have:

  • A working VictoriaMetrics instance deployed on a Kubernetes cluster
  • Access to a Slack workspace for notifications

Setting up relevant alerts essential any observability strategy. However, defining appropriate thresholds and avoiding alert fatigue requires a thoughtful and methodical approach.

We'll see in this article that it's very easy to set thresholds beyond which we would be notified. However, making these alerts relevant isn't always straightforward.

πŸ” What Makes a Good Alert?

The Guard Dog vs. The Yappy Chihuahua

A properly configured alert allows us to identify and resolve problems within our system proactively, before the situation becomes worse. Effective alerts should:

  • Signal problems requiring immediate intervention
  • Be triggered at the right time: early enough to prevent user impact, but not so frequently as to cause alert fatigue
  • Indicate the root cause or area requiring investigation. To achieve this, it's recommended to perform an analysis that prioritizes relevant metrics that directly reflect service quality and user experience (SLIs)

Therefore, it's important to focus on a controlled number of metrics to monitor. There are approaches that allow us to implement effective monitoring of our systems. Here we'll focus on two widely used alert models: Core Web Vitals and Golden Signals.

🌐 The "Core Web Vitals"

Core Web Vitals are metrics developed by Google to evaluate the user experience on web applications. They highlight metrics related to end-user satisfaction and help ensure our application offers good performance for real users. These metrics focus on three main aspects:

Core Web Vitals
  • Largest Contentful Paint (LCP), Page Load Time: LCP measures the time needed for the largest visible content element on a web page (for example, an image, video, or large text block) to be fully rendered in the web browser. A good LCP is below 2.5 seconds.

  • Interaction to Next Paint (INP), Responsiveness: INP evaluates a web page's responsiveness by measuring the latency of all user interactions, such as clicks, taps, and keyboard inputs, etc. It reflects the time needed for a page to visually respond to an interaction, that is, the delay before the browser displays the next render after a user action. A good INP should be less than 200 milliseconds

  • Cumulative Layout Shift (CLS), Visual Stability: CLS evaluates visual stability by quantifying unexpected layout shifts on a page, when elements move during loading or interaction. A good CLS score is less than or equal to 0.1.

A website's performance is considered satisfactory if it reaches the thresholds described above at the 75th percentile, thus favoring a good user experience and, consequently, better retention and search engine optimization (SEO).

Be Careful with Core Web Vitals Alerts

Adding specific alerts for these metrics requires careful consideration. Unlike classic metrics, such as availability or error rates, which directly reflect system stability, Web Vitals depend on many external factors, such as users' network conditions or their devices, making thresholds more complex to monitor effectively.

To avoid unnecessary alert overload, these alerts should only target significant degradations. For example, a sudden increase in CLS (visual stability) or a continuous deterioration of LCP (load time) over several days might indicate important problems requiring intervention.

Finally, these alerts require appropriate tools, such as RUM (Real User Monitoring) for real data or Synthetic Monitoring for simulated tests, which require a specific solution not covered in this article.

✨ The "Golden Signals"

Golden Signals

The Golden Signals are a set of four key metrics, widely used in the field of system and application monitoring, particularly with tools like Prometheus. These signals allow effective monitoring of application health and performance. They are particularly appropriate in the context of a distributed architecture:

  • Latency ⏳: It includes both successful request time and failed request time. Latency is crucial because an increase in response time can indicate performance problems.

  • Traffic πŸ“Ά: It can be measured in terms of requests per second, data throughput, or other metrics that express system load.

  • Errors ❌: This is the failure rate of requests or transactions. This can include application errors, infrastructure errors, or any situation where a request didn't complete correctly (for example, HTTP 5xx responses or rejected requests).

  • Saturation πŸ“ˆ: This is a measure of system resource usage, such as CPU, memory, or network bandwidth. Saturation indicates how close the system is to its limits. A saturated system can lead to slowdowns or failures.

These Golden Signals are essential because they allow us to focus monitoring on critical aspects that can quickly affect user experience or overall system performance. With Prometheus, these signals are often monitored via specific metrics to trigger alerts when certain thresholds are exceeded.

Other Methods and Metrics

I've mentioned here two methodologies that I find are a good starting point for optimizing our alerting system. That said, others exist, each with their specificities. We can mention USE or RED among others.

Similarly, beyond the Core Web Vitals presented above, other web metrics like FCP (First Contentful Paint) or TTFB (Time To First Byte) can prove useful depending on your specific needs.

The main thing is to keep in mind that a good alerting strategy relies on a targeted set of relevant metrics 🎯

You got it: Defining alerts requires thought! Now let's get practical and see how to define thresholds from our metrics.

πŸ” Understanding PromQL and MetricsQL Query Languages

Metrics collected with Prometheus can be queried using a specific language called PromQL (Prometheus Query Language). This language allows extracting monitoring data, performing calculations, aggregating results, applying filters, and also configuring alerts.

(ℹ️ Refer to the previous article to understand what we mean by metric.)

PromQL is a powerful language, here are some simple examples applied to metrics exposed by an Nginx web server:

  • Total number of processed requests (nginx_http_requests_total) - returns the total count since server start:

    1nginx_http_requests_total
    
  • Request rate over a 5-minute window - calculates requests per second:

    1rate(nginx_http_requests_total[5m])
    
  • Error rate - calculates 5xx errors per second over the last 5 minutes:

    1rate(nginx_http_requests_total{status=~"5.."}[5m])
    
  • Request rate by pod - calculates requests/sec for each pod in namespace "myns":

    1sum(rate(nginx_http_requests_total{namespace="myns"}[5m])) by (pod)
    

πŸ’‘ In the examples above, we made use of two Golden Signals: traffic πŸ“Ά and errors ❌.

MetricsQL is the language used with VictoriaMetrics. It aims to be compatible with PromQL with slight differences that make it easier to write complex queries.
It also brings new functions, here are some examples:

  • histogram(q): This function calculates a histogram for each group of points having the same timestamp, which is useful for visualizing a large number of time series via a heatmap.
    To create a histogram of HTTP requests:

    1histogram(rate(vm_http_requests_total[5m]))
    
  • quantiles("phiLabel", phi1, ..., phiN, q): Used to extract multiple quantiles (or percentiles) from a given metric.
    To calculate the 50th, 90th, and 99th percentiles of HTTP request rate:

    1quantiles("percentile", 0.5, 0.9, 0.99, rate(vm_http_requests_total[5m]))
    

To test your queries, you can use the demo provided by VictoriaMetrics: https://play.victoriametrics.com

πŸ› οΈ Configuring Alerts with the VictoriaMetrics Operator

VictoriaMetrics offers two essential components for alert management:

  • VMAlert: responsible for evaluating alert rules
  • AlertManager: manages routing and distribution of notifications

VMAlert: The Rule Evaluation Engine

VMAlert is the component that continuously evaluates defined alert rules. It supports two types of rules:

  • Recording Rules πŸ“Š Recording rules allow pre-calculating complex PromQL expressions and storing them as new metrics to optimize performance.

  • Alerting Rules 🚨 Alerting rules define conditions that trigger alerts when certain thresholds are exceeded.

In this blog post, we'll focus on alerting rules which are essential for proactive problem detection.

Concrete Examples

The rest of this article comes from a set of configurations you can find in the Cloud Native Ref repository.
It uses many operators, including the one for VictoriaMetrics.

This project aims to quickly start a complete platform that applies best practices in terms of automation, monitoring, security, etc.
Comments and contributions are welcome πŸ™

Declaring an Alerting Rule with VMRule

We've seen previously that VictoriaMetrics provides a Kubernetes operator that allows managing different components declaratively. Among the available custom resources, VMRule allows defining alerts and recording rules.

If you've already used the Prometheus operator, you'll find a very similar syntax as the VictoriaMetrics operator is compatible with Prometheus custom resources. (This allows to migrate easily πŸ˜‰).

Let's take a concrete example with a VMRule that monitors the health state of Flux resources:

flux/observability/vmrule.yaml

 1apiVersion: operator.victoriametrics.com/v1beta1
 2kind: VMRule
 3metadata:
 4  labels:
 5    prometheus-instance: main
 6  name: flux-system
 7  namespace: flux-system
 8spec:
 9  groups:
10    - name: flux-system
11      rules:
12        - alert: FluxReconciliationFailure
13          annotations:
14            message: Flux resource has been unhealthy for more than 5m
15            description: "{{ $labels.kind }} {{ $labels.exported_namespace }}/{{ $labels.name }} reconciliation has been failing for more than ten minutes."
16            runbook_url: "https://fluxcd.io/flux/cheatsheets/troubleshooting/"
17            dashboard: "https://grafana.priv.${domain_name}/dashboards"
18          expr: max(gotk_reconcile_condition{status="False",type="Ready"}) by (exported_namespace, name, kind) + on(exported_namespace, name, kind) (max(gotk_reconcile_condition{status="Deleted"}) by (exported_namespace, name, kind)) * 2 == 1
19          for: 10m
20          labels:
21            severity: warning

It's recommended to follow some best practices to provide maximum context for quickly identifying the root cause.

  1. Naming and Organization πŸ“

    • Use descriptive names for rules, like FluxReconciliationFailure
    • Group rules by component (ex: flux-system, flux-controllers)
    • Document reconciliation conditions in annotations
  2. Thresholds and Durations ⏱️

    • Adjust alert evaluation duration for: 10m to avoid false positives
    • Adapt thresholds according to the type of monitored resources
    • Consider different durations depending on the environment (prod/staging)
  3. Labels and Routing 🏷️

    • Add labels for routing according to context. My example isn't very advanced as it's a demo configuration. But we could very add, for instance, a team label to route to the right team, or have different routing policies depending on the environment.
      1labels:
      2  severity: [critical|warning|info]
      3  team: [sre|dev|ops]
      4  environment: [prod|staging|dev]
      
  4. The Importance of Annotations πŸ“š

Annotations allow adding various information about the alert context

  • A clear description of the reconciliation problem
  • The link to the runbook for Flux troubleshooting
  • The link to the dedicated Grafana dashboard
  1. PromQL Query πŸ”
    1expr: |
    2  max(gotk_reconcile_condition{status="False",type="Ready"}) by (exported_namespace, name, kind)
    3  + on(exported_namespace, name, kind)
    4  (max(gotk_reconcile_condition{status="Deleted"}) by (exported_namespace, name, kind)) * 2 == 1  
    
    This alert will trigger if Flux fails to reconcile a resource. In detail:
    • The gotk_reconcile_condition metric exposes the health state of Flux resources
    • The filter status="False",type="Ready" identifies resources that aren't in the "Ready" state
    • The second part of the expression (status="Deleted") detects resources that have been deleted
    • The operation + on(...) (...) * 2 == 1 combines these conditions to trigger an alert when:
      • A resource isn't "Ready" (first part = 1) AND has not been deleted (second part = 0)
      • OR a resource has been deleted (second part = 2) regardless of Ready state
    • The max and by allow grouping alerts by namespace, name, and resource type

πŸ’¬ Integration with Slack

We can send these alerts through different channels or tools. We can mention Grafana OnCall, Opsgenie, PagerDuty, or simply emails, and more...

In our example, we're sending notifications to a Slack channel. We'll first create a Slack application and retrieve the generated token before configuring VictoriaMetrics.

Slack Application Configuration

  1. Application Creation πŸ”§

    • This is done on https://api.slack.com/apps
    • Click on "Create New App"
    • Choose "From scratch"
    • Name the application (ex: "AlertManager")
    • Select the target workspace
  2. Permission Configuration πŸ”‘ In "OAuth & Permissions", add the following scopes:

    • chat:write (Required)
    • chat:write.public (For posting in public channels)
    • channels:read (For listing channels)
    • groups:read (For private groups)
  1. Installation and Token 🎟️
    • Install the application in the workspace
    • Copy the "Bot User OAuth Token" (starts with xoxb-)
    • Store the token securely. In our example, the secret is retrieved from AWS Secrets Manager using the External Secrets operator.

AlertManager Configuration for Slack

The rest of the configuration is done using Helm values to configure AlertManager

observability/base/victoria-metrics-k8s-stack/vm-common-helm-values-configmap.yaml

  1. Configure AlertManager to use the Slack token
 1    alertmanager:
 2      enabled: true
 3      spec:
 4        externalURL: "https://vmalertmanager-${cluster_name}.priv.${domain_name}"
 5        secrets:
 6          - "victoria-metrics-k8s-stack-alertmanager-slack-app"
 7      config:
 8        global:
 9          slack_api_url: "https://slack.com/api/chat.postMessage"
10          http_config:
11            authorization:
12              credentials_file: /etc/vm/secrets/victoria-metrics-k8s-stack-alertmanager-slack-app/token

The External Secrets Operator retrieves the Slack token from AWS Secrets Manager and stores it in a Kubernetes secret named victoria-metrics-k8s-stack-alertmanager-slack-app. This secret is then referenced in the Helm values to configure AlertManager's authentication (config.global.http_config.authorization.credentials_file).

  1. Routing Explanation
 1        route:
 2          group_by:
 3            - cluster
 4            - alertname
 5            - severity
 6            - namespace
 7          group_interval: 5m
 8          group_wait: 30s
 9          repeat_interval: 3h
10          receiver: "slack-monitoring"
11          routes:
12            - matchers:
13                - alertname =~ "InfoInhibitor|Watchdog|KubeCPUOvercommit"
14              receiver: "blackhole"
15        receivers:
16          - name: "blackhole"
17          - name: "slack-monitoring"
  • Alert Grouping: Alert grouping is important to reduce noise and improve notification readability. Without grouping, each alert would be sent individually, which could quickly become unmanageable. The chosen grouping criteria allow logical organization:

    • group_by defines the labels to group alerts by
    • group_wait: 30s delay before initial notification to allow grouping
    • group_interval: 5m interval between notifications for the same group
    • repeat_interval: Alerts are only repeated every 3h to avoid spam
  • Receivers: Receivers are AlertManager components that define how and where to send alert notifications. They can be configured for different communication channels like Slack, Email, PagerDuty, etc. In our configuration:

    • slack-monitoring: Main receiver that sends alerts to a specific Slack channel with custom formatting
    • blackhole: Special receiver that "absorbs" alerts without transmitting them anywhere, useful for filtering non-relevant or purely technical alerts
Routing Example

Alert routing can be customized based on your team structure and needs. Here's a practical example:

Let's say your organization has an on-call team that needs to be notified immediately about urgent issues. You can route alerts to them when:

  • The alert comes from production or security environments
  • The issue requires immediate attention from the on-call team
1        - matchers:
2            - environment =~ "prod|security"
3            - team = "oncall"
4          receiver: "pagerduty"
  1. Custom Templates πŸ“

This configuration block defines a Slack receiver for AlertManager that uses Monzo templates. Monzo templates are a set of notification templates that allow formatting Slack alerts in an elegant and informative way.

 1    alertmanager:
 2      config:
 3        receivers:
 4          - name: "slack-monitoring"
 5            slack_configs:
 6              - channel: "#alerts"
 7                send_resolved: true
 8                title: '{{ template "slack.monzo.title" . }}'
 9                icon_emoji: '{{ template "slack.monzo.icon_emoji" . }}'
10                color: '{{ template "slack.monzo.color" . }}'
11                text: '{{ template "slack.monzo.text" . }}'
12                actions:
13                  - type: button
14                    text: "Runbook :green_book:"
15                    url: "{{ (index .Alerts 0).Annotations.runbook_url }}"
16                  - type: button
17                    text: "Query :mag:"
18                    url: "{{ (index .Alerts 0).GeneratorURL }}"
19                  - type: button
20                    text: "Dashboard :grafana:"
21                    url: "{{ (index .Alerts 0).Annotations.dashboard }}"
22                  - type: button
23                    text: "Silence :no_bell:"
24                    url: '{{ template "__alert_silence_link" . }}'
25                  - type: button
26                    text: '{{ template "slack.monzo.link_button_text" . }}'
27                    url: "{{ .CommonAnnotations.link_url }}"

The notification format shown below demonstrates how alerts can be enriched with interactive elements. Users can quickly access relevant information through action buttons that link to the Grafana dashboard πŸ“Š, view the associated runbook πŸ“š, or silence noisy alerts πŸ”• when needed.

Slack alert example

πŸ‘€ Visualizing and Interacting with Alerts

VictoriaMetrics and its ecosystem provide multiple interfaces for managing and viewing alerts. Here are the main options available:

Alertmanager: The Standard Solution

Alertmanager is the standard component that allows:

  • Viewing current alert state
  • Configuring notification routing
  • Managing silences (temporarily pausing alerts)
  • Consulting alert history
Alertmanager

VMUI: The Native VictoriaMetrics Interface

VMUI offers a simplified interface for:

  • Viewing active alerts
  • Visualizing alert rules
  • Displaying associated metrics
VMAlert

Grafana Alerting: A Complete Solution

Although we use Alertmanager for alert definition and routing, Grafana Alerting offers a complete alternative solution that allows:

  • Centralizing alert management
  • Viewing alerts in the context of dashboards
  • Configuring alert rules directly from the interface
  • Managing silences and notifications
Grafana Alerting
Choosing the Right Interface

The choice of interface depends on your specific needs:

  • Alertmanager is ideal for operational alert management
  • VMUI is perfect for a quick and simple view
  • Grafana Alerting is recommended if you want a solution integrated with your dashboards

🎯 Conclusion

Defining relevant alerts is a key element of any observability strategy. The VictoriaMetrics operator, with its Kubernetes custom resources like VMRule, greatly simplifies setting up an effective alerting system. Declarative configuration allows quickly defining complex alert rules while maintaining excellent code readability and maintainability.

However, the technical configuration of alerts, even with powerful tools like VictoriaMetrics, isn't sufficient on its own. An effective alerting strategy must integrate into a broader organizational framework:

  • Clear definition of on-call procedures
  • Identification of teams responsible for monitoring
  • Implementation of runbooks and incident response procedures
  • Adaptation of notification channels according to criticality and context
Going Further πŸš€

Discover how to integrate these alerts with other components of your observability stack in upcoming articles in this series, particularly correlation with logs and distributed tracing.

πŸ”– References

Posts in this series

    Translations: