Kubernetes v1.36 Adds Stable PSI Metrics for Better Resource Contention Visibility

Why this matters

Traditional resource monitoring in Kubernetes clusters often relies on simple CPU or memory utilization metrics. While useful, these metrics can miss the real story: how much time workloads spend stalled waiting for resources. A node might show moderate CPU use but still have workloads experiencing significant delays due to resource saturation. This disconnect leads to ineffective troubleshooting and inefficient resource allocation.

Pressure Stall Information, or PSI, addresses this gap by measuring the percentage of time tasks are stalled on CPU, memory, or I/O. PSI metrics quantify resource contention as time percentages over different windows, such as 10s, 60s, and 300s averages. This allows operators to distinguish between short-lived spikes and sustained pressure, offering a more nuanced view of cluster health.

Kubernetes v1.36's graduation of PSI metrics to general availability means organizations gain a stable, reliable interface to capture this high-fidelity data. With the ability to monitor resource pressure at the node, pod, and container levels, teams can make more informed decisions to optimize cloud spending, maintain application performance, and meet compliance requirements.

Leveraging PSI metrics is especially important for SMBs and healthcare or professional services firms running sensitive production workloads. It helps prevent outages caused by hidden resource saturation that traditional metrics miss. By understanding actual stall times rather than just utilization, teams can proactively adjust capacity or optimize workloads before users notice issues.

What usually goes wrong

Operators frequently rely on CPU and memory utilization alone to assess cluster health. However, these metrics often provide a false sense of security. A node might report 70% CPU use, which looks healthy, but if some workloads are stalled due to contention, the user experience degrades silently. This mismatch causes delays in identifying bottlenecks and can lead to reactive firefighting rather than proactive management.

Another common pitfall is alert fatigue caused by misleading metrics. Without PSI, alerts might trigger on utilization thresholds that don't correlate to real performance issues. Conversely, some pressure conditions go unnoticed because utilization stays below thresholds despite significant task stalls. This creates a noisy monitoring environment that frustrates teams and wastes time.

Resource monitoring tools also sometimes struggle with overhead and scalability. Collecting detailed metrics can burden the system, especially in dense clusters with many pods. If enabling new telemetry causes performance degradation, teams hesitate to adopt it fully, missing out on critical insights.

In Kubernetes environments lacking pressure-aware metrics, capacity planning is often imprecise. Teams may overprovision to avoid issues or underprovision and suffer intermittent outages. Neither scenario is cost-effective or sustainable, particularly for SMBs juggling tight budgets and compliance.

A better Cloudain-style approach

Kubernetes v1.36 introduces PSI metrics with careful attention to performance impact and data fidelity. The integration between the Linux kernel’s PSI feature and the Kubernetes Kubelet ensures minimal overhead. Performance tests show the kernel-level PSI bookkeeping consumes less than 3.2% of CPU capacity even under heavy load, while the Kubelet’s data collection remains below 6.3% CPU usage during metric aggregation bursts.

This efficient design means teams can enable PSI metrics without fearing disruption to existing workloads. The metrics expose cumulative stall times and moving averages to indicate both transient and sustained resource pressures, allowing teams to correlate stalls precisely with workload behavior. This clarity improves incident response and capacity decisions.

The Kubelet also now detects whether the OS kernel supports PSI before emitting metrics, preventing false alerts from zero-valued data when PSI is unavailable. This refinement reduces noise in monitoring systems and ensures alerts are actionable.

With PSI metrics available at the container and pod granularity, operators can pinpoint troublesome workloads causing or suffering resource contention. This insight enables targeted optimization, such as tuning resource requests, adjusting quality of service tiers, or scheduling workloads more intelligently.

Embracing PSI metrics aligns with Cloudain’s philosophy of practical, architecture-aware cloud engineering. It provides technical teams with transparent, actionable data while respecting operational constraints and compliance demands. The stable release signals readiness for production use, encouraging adoption in sensitive environments where performance and reliability are paramount.

A simple next step

To start benefiting from PSI metrics, verify your Kubernetes nodes run Linux kernel version 4.20 or later with cgroup v2 enabled and PSI support compiled in. Avoid boot parameters disabling PSI (e.g., psi=0). Kubernetes 1.36 and later enable the feature by default, removing the need to toggle experimental flags.

Once prerequisites are met, teams can scrape PSI metrics from the /metrics/cadvisor endpoint using Prometheus-compatible monitoring tools or query live data via the Summary API. This facilitates integration into existing observability pipelines without complex rework.

Administrators should ensure appropriate permissions when accessing the Kubelet’s HTTP API for live data, as proxying requires privileged access. Security best practices dictate limiting this to trusted personnel or automation tools.

Careful interpretation of PSI data is key. Start by establishing baseline stall percentages during normal operation, then create alerts for deviations indicating emerging contention. Combine PSI metrics with traditional utilization and latency measures for a well-rounded monitoring strategy.

Additionally, consider workload tuning guided by PSI insights. For example, if memory pressure stall times rise consistently on specific nodes, evaluate pod resource requests and limits or explore vertical scaling options. This proactive optimization can improve overall cluster efficiency and user experience.

How Cloudain can help

Cloudain specializes in cloud and platform engineering tailored to SMBs handling critical production workloads. Addressing resource contention challenges with Kubernetes PSI metrics aligns closely with Cloudain’s expertise in observability and cost-effective cloud architecture.

Cloudain can assist teams in assessing cluster readiness for PSI adoption, integrating PSI data into existing monitoring frameworks, and interpreting metrics to inform actionable improvements. This includes secure access setup, alert tuning, and capacity planning informed by real stall-time data.

For businesses balancing performance, compliance, and cloud spend, Cloudain offers pragmatic guidance to leverage Kubernetes v1.36 PSI features effectively. This helps reduce downtime risk and optimize infrastructure with confidence.

Engaging Cloudain means getting practical, clear advice on implementing sophisticated observability features like PSI metrics without overcomplicating operations or increasing overhead. Cloudain’s approach ensures technology decisions serve business goals reliably and transparently.

Kubernetes v1.36 Adds Stable PSI Metrics for Better Resource Contention Visibility

Why this matters

What usually goes wrong

A better Cloudain-style approach

A simple next step

How Cloudain can help

Cloudain

Unite your teams behind measurable transformation outcomes.

Kubernetes v1.36 Adds Stable PSI Metrics for Better Resource Contention Visibility

Why this matters

What usually goes wrong

A better Cloudain-style approach

A simple next step

How Cloudain can help

Cloudain

Unite your teams behind measurable transformation outcomes.