Harness the Power of VictoriaMetrics
and Grafana
Operators for Metrics Management
Overview
Once our application is deployed, it is essential to have indicators that help identify potential issues and track performance changes. Among these sources of information, metrics and logs play an essential role by providing valuable insights into the application's operation. Additionally, it is often useful to implement detailed tracing to accurately track all actions performed within the application.
In this series of blog posts, we will explore the various areas of application monitoring. The goal is to thoroughly analyze the state of our applications, in order to improve their availability and performance, while ensuring an optimal user experience.
This first article focuses on collecting and visualizing metrics. We will deploy a scalable and high-performance solution to forward these metrics to a reliable and durable storage system. Then, we will see how to visualize them for analysis purposes.
β What is a metric
Definition
Before collecting this so-called "metric", let's first look at its definition and characteristics:A metric is a measurable data point that helps track the status and performance of an application. These data points are typically collected at regular intervals, such as the number of requests, memory usage, or error rates.
When it comes to monitoring, it is hard to avoid hearing about Prometheus. This project has contributed to the emergence of a standard that defines how metrics are exposed, called OpenMetrics, which follows this format:
Time Series: A unique
time series
is the combination of the metric's name and its labels. For instance,request_total{code="200"}
andrequest_total{code="500"}
are considered two distinct time series.Labels:
Labels
can be associated with a metric to provide more specific details. They are added after the metric's name using curly braces. Although optional, they are commonly used, especially in a Kubernetes context (pod, namespace, etc.).Value: The
value
is the numeric data collected at a specific point in time for a given time series. Depending on the metric type, it represents a measured or counted value that tracks a metric's evolution over time.Timestamp: Specifies when the data was collected (in epoch format to the millisecond). If not present, it is added when the metric is retrieved.
This full line is called araw sample
.
The more labels you have, the more possible combinations of them exist, which leads to an exponential increase in the number of time series. This total number of combinations is known as cardinality. High cardinality can significantly affect performance, especially by increasing memory usage and storage demands.
High cardinality also occurs when new metrics are frequently created. This phenomenon, known as churn rate, reflects the rate at which metrics appear and disappear within a system. In the context of Kubernetes, where pods are regularly created and deleted, this churn rate can contribute to the rapid increase in cardinality.
A Glimpse of How Metrics Are Gathered
Now that we understand what a metric is, let's see how they are collected. Most modern solutions expose an endpoint that allows scraping metrics, meaning they are queried at regular intervals. For instance, using the Prometheus SDK, available in most programming languages, it's easy to expose an endpoint for metrics collection into our applications.
This is worth noting that Prometheus generally uses a "Pull" model, where the server periodically queries targets to retrieve metrics via these exposed endpoints. This approach helps control the frequency of data collection and prevents overloading the systems.
Let's take an example with an Nginx web server. The server is installed via the Helm chart with Prometheus support enabled. Here, the parameter metrics.enabled=true
adds a path that exposes the metrics endpoint.
1helm install ogenki-nginx bitnami/nginx --set metrics.enabled=true
Then, we can retrieve a significant number of metrics with a simple HTTP call.
1kubectl port-forward svc/ogenki-nginx metrics &
2Forwarding from 127.0.0.1:9113 -> 9113
3
4curl -s localhost:9113/metrics
5...
6# TYPE promhttp_metric_handler_requests_total counter
7promhttp_metric_handler_requests_total{code="200"} 257
8...
The curl command was just an example. Generally speaking, the metrics scrapping is carried out by a system responsible for storing this data so it can later be used.
βΉοΈ When using Prometheus
, an additional component is required to Push metrics from applications: PushGateway.
In this article, Iβve chosen to introduce you to VictoriaMetrics
.
β¨ VictoriaMetrics: An Enhanced Prometheus?
Like Prometheus, VictoriaMetrics is a Time Series Database (TSDB). These databases are designed to track and store events that change over time. Although VictoriaMetrics appeared a few years after Prometheus, they share many similarities: both are open-source databases licensed under Apache 2.0, dedicated to handling time series data. VictoriaMetrics remains fully compatible with Prometheus, using the same metric format, OpenMetrics
, and supporting the PromQL
query language.
Both projects are also very active, with dynamic communities and regular contributions from various companies as you can see here.
Now, letβs explore the key differences and reasons why one might choose VictoriaMetrics:
Efficient storage and compression: This is likely one of the major advantages, especially when dealing with large amounts of data or needing long-term retention. With Prometheus, an additional component like Thanos is needed for this purpose. VictoriaMetrics, in the other hand, has an optimized storage engine that batches and optimizes data before writing it to disk. Furthermore, it uses powerful compression algorithms, making disk space usage much more efficient compared to Prometheus.
Memory footprint: VictoriaMetrics is said to use up to 7 times less memory than a Prometheus-based solution. However, the available benchmarks are somewhat outdated, and Prometheus has since benefited from several memory optimizations.
MetricsQL: VictoriaMetrics extends the PromQL language with new functions. This language is also designed to be more performant, especially on large datasets.
Modular architecture: VictoriaMetrics can be deployed in two modes: "Single" or "Cluster". Weβll explore this in more detail later in the article.
And much more...: The points above are the key reasons I highlighted, but there are others. VictoriaMetrics can also operate in Push mode, support multitenancy, and offers additional features available in the Enterprise version.
On the VictoriaMetrics website, you'll find numerous testimonials and case studies from companies that have migrated from other systems (such as Thanos, InfluxDB, etc.). Some examples are particularly insightful, especially those from Roblox, Razorpay, and Criteo, which handle a very large volume of metrics.
π A modular and scalable architecture
The remaining of this article comes from a set of configurations available in the repository Cloud Native Ref.It makes use of several operators, notably those for VictoriaMetrics and Grafana. The aim of this project is to quickly bootstrap a complete platform that follows best practices in terms of automation, monitoring, security, and more.Comments and contributions are welcome π |
VictoriaMetrics
can be deployed in various ways: The default mode is called Single
, and as the name suggests, it involves deploying a single instance that handles read, write, and storage operations. It is recommended to start with this mode as it is optimized and meets most use cases, as explained in this section.
Single Mode
The deployment method chosen in this article makes use of the Helm chart victoria-metrics-k8s-stack, which configures multiple resources (VictoriaMetrics, Grafana, Alertmanager, some dashboards, etc.). Below is a snippet of a Flux configuration for the Single
mode.
observability/base/victoria-metrics-k8s-stack/helmrelease-vmsingle.yaml
1apiVersion: helm.toolkit.fluxcd.io/v2
2kind: HelmRelease
3metadata:
4 name: victoria-metrics-k8s-stack
5 namespace: observability
6spec:
7 releaseName: victoria-metrics-k8s-stack
8 chart:
9 spec:
10 chart: victoria-metrics-k8s-stack
11 sourceRef:
12 kind: HelmRepository
13 name: victoria-metrics
14 namespace: observability
15 version: "0.25.15"
16...
17 values:
18 vmsingle:
19 spec:
20 retentionPeriod: "1d" # Minimal retention, for tests only
21 replicaCount: 1
22 storage:
23 accessModes:
24 - ReadWriteOnce
25 resources:
26 requests:
27 storage: 10Gi
28 extraArgs:
29 maxLabelsPerTimeseries: "50"
When all the Kubernetes manifests are applied, the resulting architecture looks like this:
π Private Access: Although not directly related to metric collection, I wanted to highlight how access to the various UIs is managed. I chose to use Gateway API, which Iβve been using for some time and have covered in previous blog posts. An alternative would be to use a VictoriaMetrics component, VMAuth, which can act as a proxy for authorization and routing http requests, but I did not choose this option for now.
π· VMAgent: A very lightweight agent whose main function is to gather metrics and send them to a Prometheus-compatible database. Additionally, this agent can apply filters and transformations to metrics before forwarding them. If the destination is unavailable or there is insufficient memory, it can cache data on disk.VMAgent also has a web interface that lists the "Targets" being scraped.
π₯ VMAlert & VMAlertManager: These components are responsible for sending notifications in case of issues (for instance when reaching a given threshold). I wonβt go into further detail here as this will be covered in a future article.
βοΈ VMsingle: This is the VictoriaMetrics database deployed as a single pod that handles all operations (reading, writing, and data persistence).
Once all pods are started, you can access the main VictoriaMetrics interface: VMUI
. This UI provides access to a wide range of information, including the scraped metrics, the top queries, cardinality statistics, and much more.
High Availability
To ensure we never lose sight of what's happening with our applications, the monitoring platform must always remain up and running. All VictoriaMetrics components can be configured for high availability. Depending on the desired level of redundancy, several options are available.
A straightforward approach would be to send data to two Single
instances, duplicating the data in two different locations. Additionally, these instances could be deployed in two different regions.
Itβs also recommended to deploy 2 VMAgents that scrape the same targets to ensure that no data is lost.
In such an architecture, since multiple VMAgents are sending data and scraping the same targets, we end up with duplicate metrics. The De-duplication feature in VictoriaMetrics ensures that only one version is retained when two raw samples are identical.One parameter requires special attention: -dedup.minScrapeInterval
. Only the most recent version is kept when identical raw samples are found within this time interval.
It is also recommended to:
- Set this parameter to a value equal to the
scrape_interval
defined in the Prometheus configuration. - Keep the
scrape_interval
value consistent across all scraped services.
The diagram below shows one of the many possible combinations to ensure optimal availability.β οΈ However, it's important to consider the additional costs, not only for storage and compute, but also for network transfers between zones/regions. Sometimes, having a solid backup and restore strategy is a smarter choice π .
Cluster Mode
As mentioned earlier, in most cases, the Single
mode is more than sufficient. It has the advantage of being easy to maintain and, with vertical scaling, it can handle nearly all use cases. There is also a Cluster
mode, but it is only relevant in two specific cases:
- The need for multitenancy, for example, to isolate multiple teams or customers.
- When the limits of vertical scaling are reached.
My configuration allows you to choose between either mode:
observability/base/victoria-metrics-k8s-stack/kustomization.yaml
1resources:
2...
3
4 - vm-common-helm-values-configmap.yaml
5 # Choose between single or cluster helm release
6
7 # VM Single
8 - helmrelease-vmsingle.yaml
9 - httproute-vmsingle.yaml
10
11 # VM Cluster
12 # - helmrelease-vmcluster.yaml
13 # - httproute-vmcluster.yaml
In this mode, the read, write, and storage functions are separated into three distinct deployments.
βοΈ VMInsert: Distributes the data across VMStorage instances using consistent hashing based on the time series (combination of the metric name and its labels).
πΎ VMStorage: Responsible for writing data to disk and returning the requested data to VMSelect.
π VMSelect: For each query, it retrieves the data from the VMStorages.
The main benefit of this mode is the ability to adjust scaling according to needs. For example, if more write capacity is required, you can add more VMInsert replicas.
The initial parameter that ensures a minimum level of redundancy is replicationFactor
set to 2
.Here is a snippet of the Helm values for the cluster mode.
observability/base/victoria-metrics-k8s-stack/helmrelease-vmcluster.yaml
1 vmcluster:
2 enabled: true
3 spec:
4 retentionPeriod: "10d"
5 replicationFactor: 2
6 vmstorage:
7 storage:
8 volumeClaimTemplate:
9 storageClassName: "gp3"
10 spec:
11 resources:
12 requests:
13 storage: 10Gi
14 resources:
15 limits:
16 cpu: "1"
17 memory: 1500Mi
18 affinity:
19 podAntiAffinity:
20 requiredDuringSchedulingIgnoredDuringExecution:
21 - labelSelector:
22 matchExpressions:
23 - key: "app.kubernetes.io/name"
24 operator: In
25 values:
26 - "vmstorage"
27 topologyKey: "kubernetes.io/hostname"
28 topologySpreadConstraints:
29 - labelSelector:
30 matchLabels:
31 app.kubernetes.io/name: vmstorage
32 maxSkew: 1
33 topologyKey: topology.kubernetes.io/zone
34 whenUnsatisfiable: ScheduleAnyway
35 vmselect:
36 storage:
37 volumeClaimTemplate:
38 storageClassName: "gp3"
ℹ️ It's worth noting that some of these parameters follow Kubernetes best practices, especially when using Karpenter: topologySpreadConstraints
helps distribute pods across different zones, and podAntiAffinity
ensures that two pods for the same service do not end up on the same node.
π οΈ Configuration
Alright, VictoriaMetrics is now deployed π. It's time to configure the monitoring for our applications, and for this, we'll rely on the Kubernetes operator pattern. Actually, this means declaring cCustom Resources that will be consumed by the VictoriaMetrics Operator to configure and manage VictoriaMetrics.
The Helm chart we used doesn't directly deploy VictoriaMetrics, but instead primarily installs the operator. This operator is responsible for creating and managing custom resources such as VMSingle
or VMCluster
, which define how VictoriaMetrics is deployed and configured based on the needs.
The role of VMServiceScrape
is to declare where to scrape metrics for a given service. It relies on Kubernetes labels to identify the proper service and port.
observability/base/victoria-metrics-k8s-stack/vmservicecrapes/karpenter.yaml
1apiVersion: operator.victoriametrics.com/v1beta1
2kind: VMServiceScrape
3metadata:
4 name: karpenter
5 namespace: karpenter
6spec:
7 selector:
8 matchLabels:
9 app.kubernetes.io/name: karpenter
10 endpoints:
11 - port: http-metrics
12 path: /metrics
13 namespaceSelector:
14 matchNames:
15 - karpenter
We can verify that the parameters are correctly configured using kubectl
.
1kubectl get services -n karpenter --selector app.kubernetes.io/name=karpenter -o yaml | grep -A 4 ports
2 ports:
3 - name: http-metrics
4 port: 8000
5 protocol: TCP
6 targetPort: http-metrics
Sometimes there is no service, in which case we can specify how to identify the pods directly using VMPodScrape
.
observability/base/flux-config/observability/vmpodscrape.yaml
1apiVersion: operator.victoriametrics.com/v1beta1
2kind: VMPodScrape
3metadata:
4 name: flux-system
5 namespace: flux-system
6spec:
7 namespaceSelector:
8 matchNames:
9 - flux-system
10 selector:
11 matchExpressions:
12 - key: app
13 operator: In
14 values:
15 - helm-controller
16 - source-controller
17 - kustomize-controller
18 - notification-controller
19 - image-automation-controller
20 - image-reflector-controller
21 podMetricsEndpoints:
22 - targetPort: http-prom
Not all of our applications are necessarily deployed on Kubernetes. The VMScrapeConfig
resource in VictoriaMetrics allows the use of several "Service Discovery" methods. This resource offers flexibility in defining how to scrape targets via different discovery mechanisms, such as EC2 instances (AWS), cloud services, or other systems.In the example below, we use the custom tag observability:node-exporter
and apply label transformations, allowing us to collect metrics exposed by node-exporters installed on these instances.
observability/base/victoria-metrics-k8s-stack/vmscrapeconfigs/ec2.yaml
1apiVersion: operator.victoriametrics.com/v1beta1
2kind: VMScrapeConfig
3metadata:
4 name: aws-ec2-node-exporter
5 namespace: observability
6spec:
7 ec2SDConfigs:
8 - region: ${region}
9 port: 9100
10 filters:
11 - name: tag:observability:node-exporter
12 values: ["true"]
13 relabelConfigs:
14 - action: replace
15 source_labels: [__meta_ec2_tag_Name]
16 target_label: ec2_name
17 - action: replace
18 source_labels: [__meta_ec2_tag_app]
19 target_label: ec2_application
20 - action: replace
21 source_labels: [__meta_ec2_availability_zone]
22 target_label: ec2_az
23 - action: replace
24 source_labels: [__meta_ec2_instance_id]
25 target_label: ec2_id
26 - action: replace
27 source_labels: [__meta_ec2_region]
28 target_label: ec2_region
βΉοΈ If you were already using the Prometheus Operator, migrating to VictoriaMetrics is very simple because it is fully compatible with the CRDs defined by the Prometheus Operator.
π Visualizing Metrics with the Grafana Operator
It's easy to guess what the Grafana Operator does: It uses Kubernetes resources to configure Grafana π. It allows you to deploy Grafana instances, add datasources, import dashboards from various sources (URL, JSON), organize them into folders, and more...This offers an alternative to defining everything in the Helm chart or using configmaps, and in my opinion, provides better readability. In this example, I group all the resources related to monitoring Cilium.
1tree infrastructure/base/cilium/
2infrastructure/base/cilium/
3βββ grafana-dashboards.yaml
4βββ grafana-folder.yaml
5βββ httproute-hubble-ui.yaml
6βββ kustomization.yaml
7βββ vmrules.yaml
8βββ vmservicescrapes.yaml
Defining the Folder
is super straightforward.
observability/base/infrastructure/cilium/grafana-folder.yaml
1apiVersion: grafana.integreatly.org/v1beta1
2kind: GrafanaFolder
3metadata:
4 name: cilium
5spec:
6 allowCrossNamespaceImport: true
7 instanceSelector:
8 matchLabels:
9 dashboards: "grafana"
Here is a Dashboard
resource that fetches the configuration from an HTTP link. We can also use dashboards available from the Grafana website by specifying the appropriate ID, or simply provide the definition in JSON format.
observability/base/infrastructure/cilium/grafana-dashboards.yaml
1apiVersion: grafana.integreatly.org/v1beta1
2kind: GrafanaDashboard
3metadata:
4 name: cilium-cilium
5spec:
6 folderRef: "cilium"
7 allowCrossNamespaceImport: true
8 datasources:
9 - inputName: "DS_PROMETHEUS"
10 datasourceName: "VictoriaMetrics"
11 instanceSelector:
12 matchLabels:
13 dashboards: "grafana"
14 url: "https://raw.githubusercontent.com/cilium/cilium/main/install/kubernetes/cilium/files/cilium-agent/dashboards/cilium-dashboard.json"
Note that I chose not to use the Grafana Operator to deploy the instance, but to keep the one installed via the VictoriaMetrics Helm chart. Therefore, we have to tell to the Grafana Operator where are the credentials so it can apply changes to this instance.
observability/base/grafana-operator/grafana-victoriametrics.yaml
1apiVersion: grafana.integreatly.org/v1beta1
2kind: Grafana
3metadata:
4 name: grafana-victoriametrics
5 labels:
6 dashboards: "grafana"
7spec:
8 external:
9 url: http://victoria-metrics-k8s-stack-grafana
10 adminPassword:
11 name: victoria-metrics-k8s-stack-grafana-admin
12 key: admin-password
13 adminUser:
14 name: victoria-metrics-k8s-stack-grafana-admin
15 key: admin-user
Finally, we can use Grafana and explore our various dashboards π!
π Final Thoughts
Based on the various articles reviewed, one of the main reasons to migrate to or choose VictoriaMetrics is generally better performances. However, itβs wise to remain cautious, as benchmark results depend on several factors and the specific goals in mind. This is why it's highly recommended to run your own tests. VictoriaMetrics provides a benchmarking tool that can be used on Prometheus-compatible TSDBs.
As you can see, today my preference is for VictoriaMetrics for metrics collection, as I appreciate the modular architecture with a variety of combinations depending on the evolving needs. However, a solution using the Prometheus Operator works perfectly fine in most cases and has the advantage of being governed by a foundation.
Additionally, it's important to note that some features are only available in the Enterprise version, such as downsampling, which is highly useful when wanting to retain a large amount of data over the long term.
In this article, we highlighted the ease of implementation to achieve a solution that efficiently collects and visualizes metrics. This is done while using the Kubernetes operator pattern,the "GitOps way", allowing the declaration of various resources through Custom Resources. For instance, a developer can easily include a VMServiceScrape
and a VMRule
in their manifests, thus embedding the observability culture within the application delivery processes.
Having metrics is great, but is it enough? We'll try to answer that in the upcoming articles...