Harness the Power of VictoriaMetrics and Grafana Operators for Metrics Management

Overview

How are our applications performing? πŸ‘οΈ

Once our application is deployed, it is essential to have indicators that help identify potential issues and track performance changes. Among these sources of information, metrics and logs play an essential role by providing valuable insights into the application's operation. Additionally, it is often useful to implement detailed tracing to accurately track all actions performed within the application.

In this series of blog posts, we will explore the various areas of application monitoring. The goal is to thoroughly analyze the state of our applications, in order to improve their availability and performance, while ensuring an optimal user experience.

This first article focuses on collecting and visualizing metrics. We will deploy a scalable and high-performance solution to forward these metrics to a reliable and durable storage system. Then, we will see how to visualize them for analysis purposes.

❓ What is a metric

Definition

Before collecting this so-called "metric", let's first look at its definition and characteristics:
A metric is a measurable data point that helps track the status and performance of an application. These data points are typically collected at regular intervals, such as the number of requests, memory usage, or error rates.

When it comes to monitoring, it is hard to avoid hearing about Prometheus. This project has contributed to the emergence of a standard that defines how metrics are exposed, called OpenMetrics, which follows this format:

  • Time Series: A unique time series is the combination of the metric's name and its labels. For instance, request_total{code="200"} and request_total{code="500"} are considered two distinct time series.

  • Labels: Labels can be associated with a metric to provide more specific details. They are added after the metric's name using curly braces. Although optional, they are commonly used, especially in a Kubernetes context (pod, namespace, etc.).

  • Value: The value is the numeric data collected at a specific point in time for a given time series. Depending on the metric type, it represents a measured or counted value that tracks a metric's evolution over time.

  • Timestamp: Specifies when the data was collected (in epoch format to the millisecond). If not present, it is added when the metric is retrieved.

This full line is called araw sample.

Watch out for cardinality!

The more labels you have, the more possible combinations of them exist, which leads to an exponential increase in the number of time series. This total number of combinations is known as cardinality. High cardinality can significantly affect performance, especially by increasing memory usage and storage demands.

High cardinality also occurs when new metrics are frequently created. This phenomenon, known as churn rate, reflects the rate at which metrics appear and disappear within a system. In the context of Kubernetes, where pods are regularly created and deleted, this churn rate can contribute to the rapid increase in cardinality.

A Glimpse of How Metrics Are Gathered

Now that we understand what a metric is, let's see how they are collected. Most modern solutions expose an endpoint that allows scraping metrics, meaning they are queried at regular intervals. For instance, using the Prometheus SDK, available in most programming languages, it's easy to expose an endpoint for metrics collection into our applications.

This is worth noting that Prometheus generally uses a "Pull" model, where the server periodically queries targets to retrieve metrics via these exposed endpoints. This approach helps control the frequency of data collection and prevents overloading the systems.

Let's take an example with an Nginx web server. The server is installed via the Helm chart with Prometheus support enabled. Here, the parameter metrics.enabled=true adds a path that exposes the metrics endpoint.

1helm install ogenki-nginx bitnami/nginx --set metrics.enabled=true

Then, we can retrieve a significant number of metrics with a simple HTTP call.

1kubectl port-forward svc/ogenki-nginx metrics &
2Forwarding from 127.0.0.1:9113 -> 9113
3
4curl -s localhost:9113/metrics
5...
6# TYPE promhttp_metric_handler_requests_total counter
7promhttp_metric_handler_requests_total{code="200"} 257
8...

The curl command was just an example. Generally speaking, the metrics scrapping is carried out by a system responsible for storing this data so it can later be used.

ℹ️ When using Prometheus, an additional component is required to Push metrics from applications: PushGateway.

In this article, I’ve chosen to introduce you to VictoriaMetrics.

✨ VictoriaMetrics: An Enhanced Prometheus?

Like Prometheus, VictoriaMetrics is a Time Series Database (TSDB). These databases are designed to track and store events that change over time. Although VictoriaMetrics appeared a few years after Prometheus, they share many similarities: both are open-source databases licensed under Apache 2.0, dedicated to handling time series data. VictoriaMetrics remains fully compatible with Prometheus, using the same metric format, OpenMetrics, and supporting the PromQL query language.

Both projects are also very active, with dynamic communities and regular contributions from various companies as you can see here.

Now, let’s explore the key differences and reasons why one might choose VictoriaMetrics:

  • Efficient storage and compression: This is likely one of the major advantages, especially when dealing with large amounts of data or needing long-term retention. With Prometheus, an additional component like Thanos is needed for this purpose. VictoriaMetrics, in the other hand, has an optimized storage engine that batches and optimizes data before writing it to disk. Furthermore, it uses powerful compression algorithms, making disk space usage much more efficient compared to Prometheus.

  • Memory footprint: VictoriaMetrics is said to use up to 7 times less memory than a Prometheus-based solution. However, the available benchmarks are somewhat outdated, and Prometheus has since benefited from several memory optimizations.

  • MetricsQL: VictoriaMetrics extends the PromQL language with new functions. This language is also designed to be more performant, especially on large datasets.

  • Modular architecture: VictoriaMetrics can be deployed in two modes: "Single" or "Cluster". We’ll explore this in more detail later in the article.

  • And much more...: The points above are the key reasons I highlighted, but there are others. VictoriaMetrics can also operate in Push mode, support multitenancy, and offers additional features available in the Enterprise version.

Case studies: what they say about it

On the VictoriaMetrics website, you'll find numerous testimonials and case studies from companies that have migrated from other systems (such as Thanos, InfluxDB, etc.). Some examples are particularly insightful, especially those from Roblox, Razorpay, and Criteo, which handle a very large volume of metrics.

πŸ”Ž A modular and scalable architecture

GitOps and Kubernetes Operators

The remaining of this article comes from a set of configurations available in the repository Cloud Native Ref.
It makes use of several operators, notably those for VictoriaMetrics and Grafana.

The aim of this project is to quickly bootstrap a complete platform that follows best practices in terms of automation, monitoring, security, and more.
Comments and contributions are welcome πŸ™

VictoriaMetrics can be deployed in various ways: The default mode is called Single, and as the name suggests, it involves deploying a single instance that handles read, write, and storage operations. It is recommended to start with this mode as it is optimized and meets most use cases, as explained in this section.

Single Mode

The deployment method chosen in this article makes use of the Helm chart victoria-metrics-k8s-stack, which configures multiple resources (VictoriaMetrics, Grafana, Alertmanager, some dashboards, etc.). Below is a snippet of a Flux configuration for the Single mode.

observability/base/victoria-metrics-k8s-stack/helmrelease-vmsingle.yaml

 1apiVersion: helm.toolkit.fluxcd.io/v2
 2kind: HelmRelease
 3metadata:
 4  name: victoria-metrics-k8s-stack
 5  namespace: observability
 6spec:
 7  releaseName: victoria-metrics-k8s-stack
 8  chart:
 9    spec:
10      chart: victoria-metrics-k8s-stack
11      sourceRef:
12        kind: HelmRepository
13        name: victoria-metrics
14        namespace: observability
15      version: "0.25.15"
16...
17  values:
18    vmsingle:
19      spec:
20        retentionPeriod: "1d" # Minimal retention, for tests only
21        replicaCount: 1
22        storage:
23          accessModes:
24            - ReadWriteOnce
25          resources:
26            requests:
27              storage: 10Gi
28        extraArgs:
29          maxLabelsPerTimeseries: "50"

When all the Kubernetes manifests are applied, the resulting architecture looks like this:

  • πŸ”’ Private Access: Although not directly related to metric collection, I wanted to highlight how access to the various UIs is managed. I chose to use Gateway API, which I’ve been using for some time and have covered in previous blog posts. An alternative would be to use a VictoriaMetrics component, VMAuth, which can act as a proxy for authorization and routing http requests, but I did not choose this option for now.

  • πŸ‘· VMAgent: A very lightweight agent whose main function is to gather metrics and send them to a Prometheus-compatible database. Additionally, this agent can apply filters and transformations to metrics before forwarding them. If the destination is unavailable or there is insufficient memory, it can cache data on disk.
    VMAgent also has a web interface that lists the "Targets" being scraped.

  • πŸ”₯ VMAlert & VMAlertManager: These components are responsible for sending notifications in case of issues (for instance when reaching a given threshold). I won’t go into further detail here as this will be covered in a future article.

  • βš™οΈ VMsingle: This is the VictoriaMetrics database deployed as a single pod that handles all operations (reading, writing, and data persistence).

Once all pods are started, you can access the main VictoriaMetrics interface: VMUI. This UI provides access to a wide range of information, including the scraped metrics, the top queries, cardinality statistics, and much more.

High Availability

To ensure we never lose sight of what's happening with our applications, the monitoring platform must always remain up and running. All VictoriaMetrics components can be configured for high availability. Depending on the desired level of redundancy, several options are available.

A straightforward approach would be to send data to two Single instances, duplicating the data in two different locations. Additionally, these instances could be deployed in two different regions.

It’s also recommended to deploy 2 VMAgents that scrape the same targets to ensure that no data is lost.

De-duplication setting

In such an architecture, since multiple VMAgents are sending data and scraping the same targets, we end up with duplicate metrics. The De-duplication feature in VictoriaMetrics ensures that only one version is retained when two raw samples are identical.
One parameter requires special attention: -dedup.minScrapeInterval. Only the most recent version is kept when identical raw samples are found within this time interval.

It is also recommended to:

  • Set this parameter to a value equal to the scrape_interval defined in the Prometheus configuration.
  • Keep the scrape_interval value consistent across all scraped services.

The diagram below shows one of the many possible combinations to ensure optimal availability.
⚠️ However, it's important to consider the additional costs, not only for storage and compute, but also for network transfers between zones/regions. Sometimes, having a solid backup and restore strategy is a smarter choice πŸ˜….

Cluster Mode

As mentioned earlier, in most cases, the Single mode is more than sufficient. It has the advantage of being easy to maintain and, with vertical scaling, it can handle nearly all use cases. There is also a Cluster mode, but it is only relevant in two specific cases:

  • The need for multitenancy, for example, to isolate multiple teams or customers.
  • When the limits of vertical scaling are reached.

My configuration allows you to choose between either mode:

observability/base/victoria-metrics-k8s-stack/kustomization.yaml

 1resources:
 2...
 3
 4  - vm-common-helm-values-configmap.yaml
 5  # Choose between single or cluster helm release
 6
 7  # VM Single
 8  - helmrelease-vmsingle.yaml
 9  - httproute-vmsingle.yaml
10
11  # VM Cluster
12  # - helmrelease-vmcluster.yaml
13  # - httproute-vmcluster.yaml

In this mode, the read, write, and storage functions are separated into three distinct deployments.

  • ✏️ VMInsert: Distributes the data across VMStorage instances using consistent hashing based on the time series (combination of the metric name and its labels).

  • πŸ’Ύ VMStorage: Responsible for writing data to disk and returning the requested data to VMSelect.

  • πŸ“– VMSelect: For each query, it retrieves the data from the VMStorages.

The main benefit of this mode is the ability to adjust scaling according to needs. For example, if more write capacity is required, you can add more VMInsert replicas.

The initial parameter that ensures a minimum level of redundancy is replicationFactor set to 2.
Here is a snippet of the Helm values for the cluster mode.

observability/base/victoria-metrics-k8s-stack/helmrelease-vmcluster.yaml

 1    vmcluster:
 2      enabled: true
 3      spec:
 4        retentionPeriod: "10d"
 5        replicationFactor: 2
 6        vmstorage:
 7          storage:
 8            volumeClaimTemplate:
 9              storageClassName: "gp3"
10              spec:
11                resources:
12                  requests:
13                    storage: 10Gi
14          resources:
15            limits:
16              cpu: "1"
17              memory: 1500Mi
18          affinity:
19            podAntiAffinity:
20              requiredDuringSchedulingIgnoredDuringExecution:
21                - labelSelector:
22                    matchExpressions:
23                      - key: "app.kubernetes.io/name"
24                        operator: In
25                        values:
26                          - "vmstorage"
27                  topologyKey: "kubernetes.io/hostname"
28          topologySpreadConstraints:
29            - labelSelector:
30                matchLabels:
31                  app.kubernetes.io/name: vmstorage
32              maxSkew: 1
33              topologyKey: topology.kubernetes.io/zone
34              whenUnsatisfiable: ScheduleAnyway
35        vmselect:
36          storage:
37            volumeClaimTemplate:
38              storageClassName: "gp3"

ℹ️ It's worth noting that some of these parameters follow Kubernetes best practices, especially when using Karpenter: topologySpreadConstraints helps distribute pods across different zones, and podAntiAffinity ensures that two pods for the same service do not end up on the same node.

πŸ› οΈ Configuration

Alright, VictoriaMetrics is now deployed πŸ‘. It's time to configure the monitoring for our applications, and for this, we'll rely on the Kubernetes operator pattern. Actually, this means declaring cCustom Resources that will be consumed by the VictoriaMetrics Operator to configure and manage VictoriaMetrics.

The Helm chart we used doesn't directly deploy VictoriaMetrics, but instead primarily installs the operator. This operator is responsible for creating and managing custom resources such as VMSingle or VMCluster, which define how VictoriaMetrics is deployed and configured based on the needs.

The role of VMServiceScrape is to declare where to scrape metrics for a given service. It relies on Kubernetes labels to identify the proper service and port.

observability/base/victoria-metrics-k8s-stack/vmservicecrapes/karpenter.yaml

 1apiVersion: operator.victoriametrics.com/v1beta1
 2kind: VMServiceScrape
 3metadata:
 4  name: karpenter
 5  namespace: karpenter
 6spec:
 7  selector:
 8    matchLabels:
 9      app.kubernetes.io/name: karpenter
10  endpoints:
11    - port: http-metrics
12      path: /metrics
13  namespaceSelector:
14    matchNames:
15      - karpenter

We can verify that the parameters are correctly configured using kubectl.

1kubectl get services -n karpenter --selector app.kubernetes.io/name=karpenter -o yaml | grep -A 4 ports
2    ports:
3    - name: http-metrics
4      port: 8000
5      protocol: TCP
6      targetPort: http-metrics

Sometimes there is no service, in which case we can specify how to identify the pods directly using VMPodScrape.

observability/base/flux-config/observability/vmpodscrape.yaml

 1apiVersion: operator.victoriametrics.com/v1beta1
 2kind: VMPodScrape
 3metadata:
 4  name: flux-system
 5  namespace: flux-system
 6spec:
 7  namespaceSelector:
 8    matchNames:
 9      - flux-system
10  selector:
11    matchExpressions:
12      - key: app
13        operator: In
14        values:
15          - helm-controller
16          - source-controller
17          - kustomize-controller
18          - notification-controller
19          - image-automation-controller
20          - image-reflector-controller
21  podMetricsEndpoints:
22    - targetPort: http-prom

Not all of our applications are necessarily deployed on Kubernetes. The VMScrapeConfig resource in VictoriaMetrics allows the use of several "Service Discovery" methods. This resource offers flexibility in defining how to scrape targets via different discovery mechanisms, such as EC2 instances (AWS), cloud services, or other systems.
In the example below, we use the custom tag observability:node-exporter and apply label transformations, allowing us to collect metrics exposed by node-exporters installed on these instances.

observability/base/victoria-metrics-k8s-stack/vmscrapeconfigs/ec2.yaml

 1apiVersion: operator.victoriametrics.com/v1beta1
 2kind: VMScrapeConfig
 3metadata:
 4  name: aws-ec2-node-exporter
 5  namespace: observability
 6spec:
 7  ec2SDConfigs:
 8    - region: ${region}
 9      port: 9100
10      filters:
11        - name: tag:observability:node-exporter
12          values: ["true"]
13  relabelConfigs:
14    - action: replace
15      source_labels: [__meta_ec2_tag_Name]
16      target_label: ec2_name
17    - action: replace
18      source_labels: [__meta_ec2_tag_app]
19      target_label: ec2_application
20    - action: replace
21      source_labels: [__meta_ec2_availability_zone]
22      target_label: ec2_az
23    - action: replace
24      source_labels: [__meta_ec2_instance_id]
25      target_label: ec2_id
26    - action: replace
27      source_labels: [__meta_ec2_region]
28      target_label: ec2_region

ℹ️ If you were already using the Prometheus Operator, migrating to VictoriaMetrics is very simple because it is fully compatible with the CRDs defined by the Prometheus Operator.

πŸ“ˆ Visualizing Metrics with the Grafana Operator

It's easy to guess what the Grafana Operator does: It uses Kubernetes resources to configure Grafana 😝. It allows you to deploy Grafana instances, add datasources, import dashboards from various sources (URL, JSON), organize them into folders, and more...
This offers an alternative to defining everything in the Helm chart or using configmaps, and in my opinion, provides better readability. In this example, I group all the resources related to monitoring Cilium.

1tree  infrastructure/base/cilium/
2infrastructure/base/cilium/
3β”œβ”€β”€ grafana-dashboards.yaml
4β”œβ”€β”€ grafana-folder.yaml
5β”œβ”€β”€ httproute-hubble-ui.yaml
6β”œβ”€β”€ kustomization.yaml
7β”œβ”€β”€ vmrules.yaml
8└── vmservicescrapes.yaml

Defining the Folder is super straightforward.

observability/base/infrastructure/cilium/grafana-folder.yaml

1apiVersion: grafana.integreatly.org/v1beta1
2kind: GrafanaFolder
3metadata:
4  name: cilium
5spec:
6  allowCrossNamespaceImport: true
7  instanceSelector:
8    matchLabels:
9      dashboards: "grafana"

Here is a Dashboard resource that fetches the configuration from an HTTP link. We can also use dashboards available from the Grafana website by specifying the appropriate ID, or simply provide the definition in JSON format.

observability/base/infrastructure/cilium/grafana-dashboards.yaml

 1apiVersion: grafana.integreatly.org/v1beta1
 2kind: GrafanaDashboard
 3metadata:
 4  name: cilium-cilium
 5spec:
 6  folderRef: "cilium"
 7  allowCrossNamespaceImport: true
 8  datasources:
 9    - inputName: "DS_PROMETHEUS"
10      datasourceName: "VictoriaMetrics"
11  instanceSelector:
12    matchLabels:
13      dashboards: "grafana"
14  url: "https://raw.githubusercontent.com/cilium/cilium/main/install/kubernetes/cilium/files/cilium-agent/dashboards/cilium-dashboard.json"

Note that I chose not to use the Grafana Operator to deploy the instance, but to keep the one installed via the VictoriaMetrics Helm chart. Therefore, we have to tell to the Grafana Operator where are the credentials so it can apply changes to this instance.

observability/base/grafana-operator/grafana-victoriametrics.yaml

 1apiVersion: grafana.integreatly.org/v1beta1
 2kind: Grafana
 3metadata:
 4  name: grafana-victoriametrics
 5  labels:
 6    dashboards: "grafana"
 7spec:
 8  external:
 9    url: http://victoria-metrics-k8s-stack-grafana
10    adminPassword:
11      name: victoria-metrics-k8s-stack-grafana-admin
12      key: admin-password
13    adminUser:
14      name: victoria-metrics-k8s-stack-grafana-admin
15      key: admin-user

Finally, we can use Grafana and explore our various dashboards πŸŽ‰!

πŸ’­ Final Thoughts

Based on the various articles reviewed, one of the main reasons to migrate to or choose VictoriaMetrics is generally better performances. However, it’s wise to remain cautious, as benchmark results depend on several factors and the specific goals in mind. This is why it's highly recommended to run your own tests. VictoriaMetrics provides a benchmarking tool that can be used on Prometheus-compatible TSDBs.

As you can see, today my preference is for VictoriaMetrics for metrics collection, as I appreciate the modular architecture with a variety of combinations depending on the evolving needs. However, a solution using the Prometheus Operator works perfectly fine in most cases and has the advantage of being governed by a foundation.

Additionally, it's important to note that some features are only available in the Enterprise version, such as downsampling, which is highly useful when wanting to retain a large amount of data over the long term.

In this article, we highlighted the ease of implementation to achieve a solution that efficiently collects and visualizes metrics. This is done while using the Kubernetes operator pattern,the "GitOps way", allowing the declaration of various resources through Custom Resources. For instance, a developer can easily include a VMServiceScrape and a VMRule in their manifests, thus embedding the observability culture within the application delivery processes.

Having metrics is great, but is it enough? We'll try to answer that in the upcoming articles...

πŸ”– References

Posts in this series

    Translations: