PostgreSQL: From Metrics to Query Plan Analysis
Overview
Having basic observability when running PostgreSQL in production isn't optional. Whether it's tracking the Golden Signals, monitoring slow queries, or analyzing connection patterns, this often requires juggling multiple tools, configuring various exporters, and manually extracting queries for analysis.
❓ What if we could achieve complete production-grade monitoring with minimal configuration?
In this article, we'll explore how to enhance CloudNativePG's already robust monitoring with simple, effective, and easy-to-use query performance analysis capabilities—leveraging the power of tools at our disposal: Vector, VictoriaMetrics, and VictoriaLogs.
📊 Included with CloudNativePG
When you deploy a PostgreSQL cluster with CNPG, you get a wealth of metrics and comprehensive Grafana dashboards out of the box. The operator exposes metrics via a dedicated endpoint on each PostgreSQL instance with the following information:
- Database Operations: Transaction rates, queries per second, tuple statistics
- Replication Status: Lag, streaming state, synchronization metrics
- Resource Utilization: Connections, cache hit ratio, buffer statistics
- System Health: Instance status, failover events, backup states
![]() | The examples in this article come from configurations available in the Cloud Native Ref repository.It leverages several operators, including CloudNativePG for PostgreSQL management, VictoriaMetrics for metrics collection, and VictoriaLogs for log collection. This project aims to quickly bootstrap a complete platform that follows best practices in automation, monitoring, security, and more.Comments and contributions are welcome 🙏 |
Collecting Metrics with VictoriaMetrics
CloudNativePG automatically exposes metrics on each PostgreSQL pod. To enable their collection, simply activate monitoring in the Helm chart:
1# CloudNativePG Helm chart
2monitoring:
3 podMonitorEnabled: true
This simple configuration creates a PodMonitor (Prometheus Operator resource) that is automatically converted by the VictoriaMetrics operator into a compatible native resource. Metrics from all PostgreSQL pods (primary and replicas) are thus collected and available in VictoriaMetrics.
The VictoriaMetrics operator automatically converts Prometheus Operator resources (PodMonitor, ServiceMonitor, etc.) into their VictoriaMetrics equivalents. This transparent conversion allows using CloudNativePG without modification, while benefiting from VictoriaMetrics as the storage backend.
Essential Metrics to Monitor
CloudNativePG exposes metrics that align perfectly with the Golden Signals methodology discussed in previous articles:
Latency ⏳
1# Average query duration
2rate(cnpg_backends_total_seconds_sum[5m]) / rate(cnpg_backends_total_seconds_count[5m])
Traffic 📶
1# Transactions per second
2rate(pg_stat_database_xact_commit[5m]) + rate(pg_stat_database_xact_rollback[5m])
Errors ❌
1# Connection failures and deadlocks
2rate(pg_stat_database_deadlocks[5m])
3rate(cnpg_pg_postmaster_start_time_seconds[5m])
Saturation 📈
1# Connection pool usage
2cnpg_backends_total / cnpg_pg_settings_max_connections
3
4# Cache hit ratio (should be > 95%)
5sum(rate(pg_stat_database_blks_hit[5m])) /
6 (sum(rate(pg_stat_database_blks_hit[5m])) + sum(rate(pg_stat_database_blks_read[5m])))
Visualization with Grafana
Using the Grafana Operator explored in previous articles, we can deploy CNPG dashboards declaratively:
1apiVersion: grafana.integreatly.org/v1beta1
2kind: GrafanaDashboard
3metadata:
4 name: databases-cloudnative-pg
5spec:
6 allowCrossNamespaceImport: true
7 folderRef: "databases"
8 datasources:
9 - inputName: "DS_PROMETHEUS"
10 datasourceName: "VictoriaMetrics"
11 instanceSelector:
12 matchLabels:
13 dashboards: "grafana"
14 url: "https://grafana.com/api/dashboards/20417/revisions/4/download"
The dashboard provides comprehensive views of our PostgreSQL clusters, including replication lag, query performance, and resource utilization.

🔍 Understanding Query Performance
While metrics give us the "what" and "when," they don't always tell us the "why." This is where query performance analysis becomes essential. Knowing that queries are slow is useful; understanding why they're slow often leads us to optimization opportunities.
Traditional PostgreSQL query analysis requires manually running EXPLAIN and EXPLAIN ANALYZE commands. While powerful, this approach has limitations:
- Reactive: You only analyze queries you suspect are problematic
- Manual: Requires active investigation by DBAs
- Point-in-time: Captures current execution plan, not historical trends
- Incomplete: Difficult to correlate with production load patterns
Ideally, we need automatic and continuous query plan capture that allows us to:
- Automatically identify slow queries
- Track execution plan changes over time
- Correlate query performance with system metrics
- Debug performance issues without manually reproducing them
This is exactly what some managed solutions offer. But can we achieve the same on Kubernetes?
✨ Query Plan History: Implementation with Open Source Tools
The good news is that we can build a sophisticated query performance monitoring system that rivals commercial offerings with minimal configuration.
The Architecture

Our solution leverages two PostgreSQL extensions and a configuration parameter, integrating them with the VictoriaMetrics ecosystem:
PostgreSQL Extensions:
- pg_stat_statements: Aggregates query execution statistics
- auto_explain: Automatically captures execution plans for slow queries
PostgreSQL Configuration:
- compute_query_id: Parameter that generates unique identifiers for query correlation
Observability Stack:
- VictoriaMetrics: Stores query metrics from pg_stat_statements
- VictoriaLogs: Stores execution plans with query correlation
- Vector: Parses PostgreSQL logs and extracts execution plans
- Grafana: Visualizes performance data and enables plan history exploration
The essential link between all these elements is the correlation between metrics and logs using the query identifier. This allows us to:
- See that a query is slow (from metrics)
- Click to view its execution plan history (from logs)
- Identify plan changes that caused performance regressions
I've called this feature "Performance Insights". Any resemblance to an existing solution would be purely coincidental 😆.
Enabling Performance Insights
Thanks to CloudNativePG's "Managed Extensions" feature (available since v1.23), enabling comprehensive query monitoring is remarkably simple.
🏗️ Platform Engineering: The Right Level of Abstraction
One of the key principles of platform engineering is providing the right level of abstraction to application developers. They shouldn't need to understand PostgreSQL internals or memorize 15+ PostgreSQL-specific configuration parameters.
This is where Crossplane compositions excel. In the Cloud Native Ref project, we use Crossplane with KCL (Kubernetes Configuration Language) to create a higher-level abstraction called SQLInstance.
Without Composition (Raw CNPG Cluster):
1apiVersion: postgresql.cnpg.io/v1
2kind: Cluster
3metadata:
4 name: myapp-postgres
5spec:
6 instances: 3
7 postgresql:
8 shared_preload_libraries:
9 - pg_stat_statements
10 - auto_explain
11 parameters:
12 pg_stat_statements.max: "10000"
13 pg_stat_statements.track: all
14 pg_stat_statements.track_utility: "on"
15 pg_stat_statements.track_planning: "on"
16 pg_stat_statements.save: "on"
17 auto_explain.log_format: json
18 auto_explain.log_min_duration: "1000"
19 auto_explain.log_analyze: "on"
20 auto_explain.log_buffers: "on"
21 auto_explain.log_timing: "off"
22 auto_explain.log_triggers: "on"
23 auto_explain.log_verbose: "on"
24 auto_explain.log_nested_statements: "on"
25 auto_explain.sample_rate: "0.2"
26 compute_query_id: on
27 track_activity_query_size: 2048
28 track_io_timing: "on"
29 log_min_duration_statement: 1000
30 # ... and more
With Composition (Platform Engineering Approach):
1apiVersion: cloud.ogenki.io/v1alpha1
2kind: App
3metadata:
4 name: myapp
5spec:
6 sqlInstance:
7 enabled: true
8 size: small
9 storageSize: 20Gi
10 instances: 3
11 performanceInsights:
12 enabled: true
13 explain:
14 sampleRate: 0.2 # 20% sampling (default: safe for production)
15 minDuration: 1000 # Log queries > 1 second (default)
16 logStatement: none # Optional: none (default) / ddl / mod / all
The Crossplane composition SQLInstance handles all the complexity.
This composition approach provides several benefits:
- Developer Experience: Application developers don't need PostgreSQL expertise
- Consistency: Performance monitoring is configured uniformly across all databases
- Maintainability: Platform team controls monitoring configuration centrally
- Scalability: Easy to update monitoring parameters for all instances
- Discoverability: Developers can browse available options (
performanceInsights: true) rather than memorizing parameter names
The best abstractions hide complexity without limiting power. Developers get performance insights with simple parameters, while the platform team retains the ability to fine-tune the underlying PostgreSQL configuration for advanced use cases.
Understanding the Configuration
Let's break down what each component does:
pg_stat_statements: This extension tracks execution statistics for all SQL statements executed by a server. It records:
- Total execution time and number of calls
- Rows processed and returned
- Buffer hits and reads
- Query planning time
auto_explain: Automatically logs execution plans for queries exceeding a duration threshold. Key parameters include:
log_format: json: Structured output for parsinglog_min_duration: 1000: Capture queries taking more than 1 second (default)log_analyze: on: Include actual row counts (includes ANALYZE data from the actual execution)sample_rate: 0.2: Sample 20% of slow queries to reduce overhead (default)
compute_query_id: The correlation key that ties everything together. This generates a unique identifier for each query that appears in both pg_stat_statements metrics and auto_explain logs.
By default, the composition uses production-safe values:
sampleRate: 0.2→ 20% sampling of slow queriesminDuration: 1000ms→ Capture queries taking more than 1 second
For debugging, increase these values:
sampleRate: 1.0→ 100% of slow queriesminDuration: 0→ All queries, even the fastest ones
Vector Log Pipeline Configuration
Here's what Vector does concretely - transforming a PostgreSQL auto_explain log into an indexable event:
| Raw Log (CloudNativePG) | After Vector Parsing |
|---|---|
{ | { |
3-Step Pipeline
The Vector pipeline consists of 3 transforms and 2 sinks (complete configuration):
1. Parse CloudNativePG JSON logs
1if .kubernetes.container_name == "postgres" && exists(.kubernetes.pod_labels."cnpg.io/cluster") {
2 .log = parse_json(.message)
3}
2. Filter for execution plans
1exists(.log.record.message) && contains(.log.record.message, "plan:")
3. Extract metadata and plan
1.query_id = to_string!(.log.record.query_id) # Correlation key
2.cluster_name = .kubernetes.pod_labels."cnpg.io/cluster"
3.database = .log.record.database_name
4# Parse plan JSON from "duration: X ms plan: {...}"
5.plan_json = parse_json(split(.log.record.message, "plan:")[1])
Parsed events are sent to two sinks:
- Successful plans → VictoriaLogs with indexing on
cluster_name,namespace,database,query_id - Parse failures → Separate stream for debugging
The Key: Correlation via query_id
The critical element is query_id which appears in both systems:
- VictoriaMetrics:
pg_stat_statements{queryid="8765432109876543210"}(metrics) - VictoriaLogs:
{query_id="8765432109876543210"}(plans)
This correlation allows instant jumping from a performance metric to execution plan history.
🔬 Analyzing Query Performance in Action
Once we've identified a problematic query, we can view its execution plan history in VictoriaLogs. Using the query_id from pg_stat_statements, we can query VictoriaLogs:
1# Find all execution plans for a specific query
2{cluster_name="myapp-postgres", query_id="1234567890"} | limit 50
This shows us:
- All captured execution plans for this query over time
- Plan variations (e.g., index scans vs. sequential scans)
- Actual row counts and execution times
- Buffer usage and I/O statistics
Understanding EXPLAIN Output
When auto_explain captures a plan, it provides detailed information:
1{
2 "Query Text": "SELECT * FROM users WHERE email = ?",
3 "Query Identifier": 1234567890,
4 "Duration": 1567.234,
5 "Plan": {
6 "Node Type": "Seq Scan",
7 "Relation Name": "users",
8 "Actual Rows": 1,
9 "Actual Loops": 1,
10 "Actual Total Time": 1567.123,
11 "Shared Hit Blocks": 0,
12 "Shared Read Blocks": 54321
13 }
14}
Key insights from this plan:
- Sequential Scan: Scanning the entire table instead of using an index
- High block reads: 54,321 blocks read from disk (poor cache usage)
- Single row returned: Despite scanning the entire table
This immediately suggests the need for an index on the email column.
🎨 Visualizing Execution Plans with pev2
Understanding complex execution plans from logs can be challenging. This is where pev2 (PostgreSQL Explain Visualizer 2) becomes very useful. It's a web tool that transforms JSON execution plans into interactive, visual diagrams.
1apiVersion: cloud.ogenki.io/v1alpha1
2kind: App
3metadata:
4 name: xplane-pev2
5 namespace: tooling
6spec:
7 image:
8 repository: ghcr.io/smana/pev2
9 tag: "v1.17.0"
10
11 resources:
12 requests:
13 cpu: "10m" # Minimal CPU for static content
14 memory: "32Mi" # Small memory footprint
15 limits:
16 cpu: "300m" # Cap to prevent runaway
17 memory: "128Mi" # Limit memory usage
18
19 # Accessible only via Tailscale VPN at: https://pev2.priv.cloud.ogenki.io
20 route:
21 enabled: true
22 hostname: "pev2" # Results in: pev2.priv.cloud.ogenki.io
To ensure sensitive query data never leaves the network, pev2 is self-hosted in the cluster via the App composition. This once again demonstrates the platform abstraction level: deploying a static web tool uses the same declarative API as a complete application with a database.
Analyzing with Grafana
Grafana integration allows quick identification of problematic queries and navigation to their execution plans.
Performance Analysis Dashboard

This dashboard displays key metrics from pg_stat_statements: top queries by total duration, average latency, number of calls. Each query_id is clickable to explore details.
Correlation Dashboard

This dashboard correlates metrics (VictoriaMetrics) with execution plans (VictoriaLogs) for a specific query. It shows performance evolution and plan changes over time.
Workflow: From Grafana to pev2
The video below shows the complete investigation workflow: from identifying a slow query in Grafana to visual plan analysis with pev2.
Workflow steps:
- Identify the slow query in the Grafana dashboard (
pg_stat_statementsmetrics) - Click on the query_id to view plan history in VictoriaLogs
- Copy the JSON plan and open pev2 (
https://pev2.priv.cloud.ogenki.io) - Paste the plan (Ctrl+V) to visualize execution
Leveraging pev2

pev2 transforms JSON plans into interactive diagrams that instantly reveal:
- Bottlenecks: Larger nodes = higher execution time (orange/yellow badges = warnings)
- Planner estimates: Discrepancies between estimated and actual rows (e.g., "under estimated by 443×" visible on the main Hash Join)
- Inefficient sequential scans: Indicators on large tables suggesting missing indexes
- Join strategies: Hash Join, Nested Loop, Merge Join with their respective costs
- I/O patterns: Ratios of blocks read from disk vs. cache (buffer hits)
The interactive interface allows clicking on each node to view details (cost, timing, row count, buffers). Warning badges immediately signal potential issues (wrong estimates, inefficient scans).
For performance regressions, VictoriaLogs allows comparing plans before/after by filtering by time period (_time:[...]), revealing changes in PostgreSQL planner strategy.
💭 Final Thoughts
We've built in this article a complete PostgreSQL performance analysis system that combines metrics (pg_stat_statements), execution plans (auto_explain), and visualization (pev2). The key to this approach lies in correlation via query_id: from a Grafana dashboard showing a slow query, a few clicks are enough to navigate to its execution plan visualized in pev2, enabling performance analysis and optimization.
This is, once again, a demonstration of the power of available open source tools. CloudNativePG with added extensions, VictoriaMetrics and VictoriaLogs efficiently store metrics and logs, Vector parses and structures data, and Grafana offers unified visualization. This Kubernetes-native approach is portable and gives complete control.
The abstraction provided by Crossplane further amplifies this ease. Thanks to the App and SQLInstance compositions, enabling Performance Insights boils down to performanceInsights.enabled: true with a few tuning parameters (sampleRate, minDuration). Developers don't need to understand PostgreSQL internals or Vector—the platform masks the complexity. This same declarative API deploys both a complete database and a static web tool like pev2, demonstrating the consistency of the abstraction level.
The cloud-native-ref project brings all these pieces together and shows how Gateway API, Tailscale, Crossplane/KCL, and the VictoriaMetrics ecosystem assemble to create a complete observability platform.
Enabling Performance Insights involves a measured overhead of 3-4% CPU and ~200-250MB memory with default values:
pg_stat_statements: ~1% CPU, 50-100MB RAMauto_explain(sample_rate=0.2, log_timing=off): ~1% CPU, 50-100MB RAMVectorparsing: <1% CPU, ~128MB RAM
The default values (sampleRate: 0.2, minDuration: 1000) are suitable for production. Adjust according to your needs:
- High-traffic production:
sampleRate: 0.1(10%) +minDuration: 3000(>3s) — reduce overhead to ~2-3% CPU - Debugging/Development:
sampleRate: 1.0(100%) +minDuration: 0+log_timing: on— ~5-7% CPU overhead for maximum capture - Standard production: default values (
sampleRate: 0.2,minDuration: 1000) — excellent balance at 3-4% CPU
The sampleRate, log_timing, and logStatement parameters allow fine-tuning performance impact.
🔖 References
CloudNativePG Documentation
PostgreSQL Extensions and Features
- pg_stat_statements - Query statistics extension
- auto_explain - Automatic execution plan logging
- compute_query_id - Query identifier generation
- EXPLAIN Documentation - Understanding query plans
VictoriaMetrics Ecosystem
Query Plan Visualization
- pev2 (PostgreSQL Explain Visualizer 2) - Interactive execution plan visualization
- pev2 Docker Image - Self-hosted deployment
Configuration and Implementation
- Vector VRL Configuration - PostgreSQL log parsing pipeline
- CloudNativePG Composition - SQLInstance abstraction with KCL
- PostgreSQL Monitoring Architecture - Complete architecture documentation
