
A subtle but consequential Linux kernel bug in the Mellanox/MLX5 driver has been assigned CVE‑2025‑40250: mlx5_irq_alloc could, on a failed request_irq caused by exhausted IRQ vectors, free the entire IRQ mapping (rmap) rather than only the mapping that failed, potentially triggering general protection faults and kernel crashes; upstream maintainers have fixed the behavior by changing the cleanup to remove only the newly added IRQ glue on failure.
Background
The mlx5 driver (net/mlx5) supports Mellanox/NVIDIA network and InfiniBand devices and contains IRQ allocation logic that maps hardware IRQ vectors to internal “glue” objects used by the driver. Under normal conditions, request_irq allocates an IRQ vector and the driver records a corresponding mapping. If request_irq fails — for example because the system has exhausted available IRQ vectors — the driver must undo only the partial state that relates to the allocation that failed. The defective behavior observed in vulnerable kernels instead removed the entire rmap, invalidating mappings that other threads might still be using and prompting crashes such as a general protection fault in free_irq_cpu_rmap. This is not an exotic attack surface: the failure mode is triggered by IRQ allocation exhaustion and specific driver configuration options (reports indicate the problem can appear when both fwctl and rds configurations are enabled). The practical attack consequences are primarily availability-focused — kernel oopses or crashes — rather than immediate remote code execution. Public vulnerability trackers and the Debian/OSV records reflect that characterization and link the CVE to a small upstream cleanup change.Why this bug matters (short technical summary)
- The bug arises during a failure path: when request_irq fails, the driver attempted to roll back, but its rollback logic was too broad and removed mappings it should not have touched.
- Removing unrelated mappings can leave other code paths referencing freed objects — a classic recipe for race windows, invalid dereferences, and kernel panics.
- The immediate symptom in reported logs is a failure to request IRQ (err = -28) followed by a general protection fault in free_irq_cpu_rmap or other IRQ/freeing code paths.
Technical anatomy: what went wrong, in developer terms
The actors: mlx5_irq_alloc, request_irq, rmap and free paths
- mlx5_irq_alloc: the driver helper that prepares IRQ glue objects and calls request_irq to bind a handler to a vector.
- request_irq: the kernel API that assigns an IRQ vector and registers the handler; it can fail for multiple reasons (including vector exhaustion).
- rmap (IRQ mapping table): driver-managed mappings between allocated vectors and driver glue objects.
- free_irq_cpu_rmap / other free paths: kernel mechanics used to remove IRQ mappings and free associated resources.
- mlx5_irq_alloc creates/updates mapping structures and attempts request_irq.
- request_irq fails (e.g., -28 / ENOSPC when IRQ vectors are exhausted).
- Cleanup code intended to revert that failed allocation instead removed the whole rmap rather than only the entry that was being added.
- Concurrent threads that still referenced other mappings now point to freed/invalid state; later accesses cause a GPF / oops.
The fix (what maintainers changed)
Upstream changed the cleanup path to ensure that, on a failed request_irq, the driver removes only the mapping that was just created for the failed allocation. This prevents accidental removal of valid mappings and preserves invariants other threads expect when they touch rmap entries. The commit(s) implementing this change are the canonical fix referenced by CVE trackers.Cross‑verification and evidence
Independent vulnerability trackers and security databases record the same problem description and point to upstream stable commits as the remedy. The NVD entry summarizes the failure mode and the general protection fault symptom that appears in stack traces. Other aggregators and OSV/Distribution trackers echo the same root cause and remediation guidance. These independent records provide corroboration that the defect and the intended code change are being tracked consistently across the ecosystem. Where traces or log excerpts are publicly shown, they typically include messages like “Failed to request irq. err = -28” preceding an oops, which matches the upstream description and the commit rationale. That alignment between log evidence, patch commentary, and CVE metadata increases confidence in the technical narrative. Cautionary note: the canonical kernel commit(s) implementing the minimal cleanup change are the authoritative technical artefacts; where a vendor or distribution publishes a patched kernel package, verify the package changelog references the upstream stable commit ID(s) before treating the host as remediated. Public trackers frequently link to those commits for verification.Affected systems and exposure model
- Affected component: Linux kernel net/mlx5 driver (Mellanox/NVIDIA mlx5 cores used for modern NIC/InfiniBand devices).
- Typical exposure vectors: local or operational — the bug is triggered when the kernel’s IRQ vector pool is exhausted for the device and the mlx5 driver hits the failed request_irq path; some configuration combinations (for example, fwctl + rds enabled) are called out in public descriptions.
- Practical impact: availability-first — kernel oopses, potential panics, or hard crashes; not a straightforward unauthenticated remote RCE.
- High‑performance networking or InfiniBand clusters using mlx5 hardware.
- Virtualization and cloud hosts that expose Mellanox NICs or rely on accelerated network stacks.
- Vendors and embedded appliances that include vendor‑forked kernels with delayed backports.
- Any multi‑tenant hosts where kernel crashes produce outsized operational impact.
Exploitability and risk analysis
- Immediate exploitability: low for remote unauthenticated actors. The flaw is a robustness error that converts a local failure into a crash, not an obvious path to arbitrary code execution.
- Realistic risk: local denial‑of‑service and potential for a crash during heavy allocation pressure or misconfiguration, which is significant for multi‑tenant and production hosts.
- Chaining risk: as with many kernel faults, availability faults can be staging ground for more complex exploitation when combined with other primitives. Treat local kernel instability as a high‑value primitive for attackers with a foothold.
Detection and hunting guidance
Because the primary symptom is availability loss rather than data leakage, detection relies on kernel telemetry and crash traces:- Kernel logs to watch for:
- “Failed to request irq. err = -28” or similar request_irq failure messages from mlx5_core.
- General protection faults referencing free_irq_cpu_rmap or similar IRQ free paths in the backtrace.
- Repeated mlx5-related oopses or panics correlated with IRQ allocation events.
- Practical commands and triage steps:
- Inspect dmesg or journalctl -k for mlx5_core and mlx5_ib messages around times of instability.
- Correlate host-level IRQ usage and vector exhaustion indicators (platform-specific tooling and kernel reports).
- If available, capture vmcore or kernel crash dumps for forensic analysis to preserve the backtrace and timestamp.
- Hunting signals:
- Frequent kernel oopses following allocation attempts for NIC/InfiniBand vectors.
- Service interruptions on hosts known to run Mellanox devices during times of high allocation pressure.
Remediation: patching, verification, and mitigation
Definitive remediation- Apply a kernel package from your vendor or distribution that explicitly lists the upstream stable commit(s) addressing the mlx5 cleanup logic. Distributions will either ship a package that includes the stable commit or provide advisory mappings. Verify the package changelog references the upstream commit ID(s) before declaring a host remediated.
- Reboot into the patched kernel.
- Inventory: identify hosts with Mellanox devices (lspci, ethtool -i, uname -r, and kernel config checks).
- Pilot: deploy the patched kernel to a non‑production representative host and stress-test with workload patterns that previously triggered IRQ allocation or heavy device initialization.
- Roll out in waves: pilot → staging → production, monitoring kernel logs for two weeks after each wave for regressions.
- Reduce IRQ vector pressure where possible (device configuration tuning, disable unused features that increase vector allocation).
- Limit access to hosts with Mellanox devices or restrict untrusted workloads that may provoke heavy device initialization or firmware control paths.
- If vendor images are patched but a full rollout isn’t possible, move critical tenants to patched hosts if feasible. These are stopgaps and not substitutes for applying the upstream fix.
- Confirm kernel package changelog or vendor advisory lists CVE‑2025‑40250 or the exact upstream commit IDs.
- Reboot and reproduce previously observed failure traces (if you have a lab test); absence of oopses during the same workload is a strong signal of remediation.
- Continue monitoring kernel logs for related messages for a monitoring window appropriate to your environment.
Operational prioritization (who should patch first)
- Multi‑tenant hypervisor and cloud compute nodes exposing Mellanox NICs (highest priority).
- InfiniBand/HPC clusters and RDMA-enabled storage hosts used in production.
- Virtualization hosts and appliances where an unexpected kernel panic could cascade through automation.
- Single‑tenant servers or developer workstations (lower, but still necessary, priority).
Why the upstream approach is correct — strengths and remaining risks
Strengths- The upstream fix is minimal and targeted, addressing only the incorrect cleanup scope. Minimal changes reduce regression risk and make backporting into stable trees easier for distributions and vendors.
- The fix preserves normal behavior while closing the crash-inducing path, which is the preferred pattern for kernel correctness patches.
- Vendor‑forked kernels and embedded images may lag on backports; such devices remain at risk until vendors ship patched images.
- Detection gaps: automated reboots and suppressed panic logs can hide incidents. Centralized telemetry and crash‑dump preservation are essential.
- Chaining: while the patch closes the immediate crash, availability primitives in the kernel remain attractive for attackers with local access; prioritize patching in multi‑tenant environments.
Concrete incident playbook (for responders)
- If you observe an mlx5-related kernel oops or panic:
- Preserve dmesg and vmcore, collect journalctl -k output, and snapshot /proc/interrupts and /proc/irq/* entries.
- Correlate crash time to device actions and workload traces (device probe, driver reloads, container/VM migrations, etc..
- If remediation is not immediately available, isolate impacted hosts from production pools and move critical workloads to patched or alternative hosts.
- Apply the upstream or vendor patch, reboot, and validate using representative workloads that previously triggered the condition.
Final assessment and recommendations
CVE‑2025‑40250 is a typical example of a small, correctness‑focused kernel vulnerability that has outsized operational impact when left unpatched — a single failed request_irq should never break other mappings or invalidate state used by concurrent threads. Upstream maintainers implemented a precise and minimal cleanup change to ensure failed IRQ allocations remove only the intended mapping, which preserves invariants and restores robust behavior. Recommendations:- Treat this CVE as a high priority for hosts that run Mellanox/NVIDIA mlx5 hardware in production, especially multi‑tenant, virtualization, or RDMA/HPC environments.
- Verify vendor kernel packages reference the upstream stable commit(s) and perform a staged rollout with post‑patch monitoring.
- Ensure kernel crash telemetry (dmesg, vmcore, centralized logging) is enabled so allocation failures and oopses are visible and preserved for triage.
The kernel patch restores a key safety invariant: error paths must clean up only what they created. That simple rule prevents local allocation failures from cascading into system‑wide crashes, and it’s precisely the kind of minimal, high‑value fix kernel maintainers prefer to apply quickly to stable trees.
Source: MSRC Security Update Guide - Microsoft Security Response Center