CVE-2025-40250: Linux mlx5 IRQ cleanup bug fix stops kernel crashes

  • Thread Author
Mellanox chip on a circuit board with glowing blue and red data lines.
A subtle but consequential Linux kernel bug in the Mellanox/MLX5 driver has been assigned CVE‑2025‑40250: mlx5_irq_alloc could, on a failed request_irq caused by exhausted IRQ vectors, free the entire IRQ mapping (rmap) rather than only the mapping that failed, potentially triggering general protection faults and kernel crashes; upstream maintainers have fixed the behavior by changing the cleanup to remove only the newly added IRQ glue on failure.

Background​

The mlx5 driver (net/mlx5) supports Mellanox/NVIDIA network and InfiniBand devices and contains IRQ allocation logic that maps hardware IRQ vectors to internal “glue” objects used by the driver. Under normal conditions, request_irq allocates an IRQ vector and the driver records a corresponding mapping. If request_irq fails — for example because the system has exhausted available IRQ vectors — the driver must undo only the partial state that relates to the allocation that failed. The defective behavior observed in vulnerable kernels instead removed the entire rmap, invalidating mappings that other threads might still be using and prompting crashes such as a general protection fault in free_irq_cpu_rmap. This is not an exotic attack surface: the failure mode is triggered by IRQ allocation exhaustion and specific driver configuration options (reports indicate the problem can appear when both fwctl and rds configurations are enabled). The practical attack consequences are primarily availability-focused — kernel oopses or crashes — rather than immediate remote code execution. Public vulnerability trackers and the Debian/OSV records reflect that characterization and link the CVE to a small upstream cleanup change.

Why this bug matters (short technical summary)​

  • The bug arises during a failure path: when request_irq fails, the driver attempted to roll back, but its rollback logic was too broad and removed mappings it should not have touched.
  • Removing unrelated mappings can leave other code paths referencing freed objects — a classic recipe for race windows, invalid dereferences, and kernel panics.
  • The immediate symptom in reported logs is a failure to request IRQ (err = -28) followed by a general protection fault in free_irq_cpu_rmap or other IRQ/freeing code paths.
This combination — a failure during IRQ allocation followed by an overzealous cleanup routine — converts what should be a recoverable allocation failure into a system‑affecting kernel crash. The fix is surgical: only remove the specific IRQ glue added for the failed allocation rather than deleting the whole rmap. That conservative patch minimizes behavioral change while removing the crash trigger.

Technical anatomy: what went wrong, in developer terms​

The actors: mlx5_irq_alloc, request_irq, rmap and free paths​

  • mlx5_irq_alloc: the driver helper that prepares IRQ glue objects and calls request_irq to bind a handler to a vector.
  • request_irq: the kernel API that assigns an IRQ vector and registers the handler; it can fail for multiple reasons (including vector exhaustion).
  • rmap (IRQ mapping table): driver-managed mappings between allocated vectors and driver glue objects.
  • free_irq_cpu_rmap / other free paths: kernel mechanics used to remove IRQ mappings and free associated resources.
The vulnerable sequence:
  1. mlx5_irq_alloc creates/updates mapping structures and attempts request_irq.
  2. request_irq fails (e.g., -28 / ENOSPC when IRQ vectors are exhausted).
  3. Cleanup code intended to revert that failed allocation instead removed the whole rmap rather than only the entry that was being added.
  4. Concurrent threads that still referenced other mappings now point to freed/invalid state; later accesses cause a GPF / oops.

The fix (what maintainers changed)​

Upstream changed the cleanup path to ensure that, on a failed request_irq, the driver removes only the mapping that was just created for the failed allocation. This prevents accidental removal of valid mappings and preserves invariants other threads expect when they touch rmap entries. The commit(s) implementing this change are the canonical fix referenced by CVE trackers.

Cross‑verification and evidence​

Independent vulnerability trackers and security databases record the same problem description and point to upstream stable commits as the remedy. The NVD entry summarizes the failure mode and the general protection fault symptom that appears in stack traces. Other aggregators and OSV/Distribution trackers echo the same root cause and remediation guidance. These independent records provide corroboration that the defect and the intended code change are being tracked consistently across the ecosystem. Where traces or log excerpts are publicly shown, they typically include messages like “Failed to request irq. err = -28” preceding an oops, which matches the upstream description and the commit rationale. That alignment between log evidence, patch commentary, and CVE metadata increases confidence in the technical narrative. Cautionary note: the canonical kernel commit(s) implementing the minimal cleanup change are the authoritative technical artefacts; where a vendor or distribution publishes a patched kernel package, verify the package changelog references the upstream stable commit ID(s) before treating the host as remediated. Public trackers frequently link to those commits for verification.

Affected systems and exposure model​

  • Affected component: Linux kernel net/mlx5 driver (Mellanox/NVIDIA mlx5 cores used for modern NIC/InfiniBand devices).
  • Typical exposure vectors: local or operational — the bug is triggered when the kernel’s IRQ vector pool is exhausted for the device and the mlx5 driver hits the failed request_irq path; some configuration combinations (for example, fwctl + rds enabled) are called out in public descriptions.
  • Practical impact: availability-first — kernel oopses, potential panics, or hard crashes; not a straightforward unauthenticated remote RCE.
Who should care most:
  • High‑performance networking or InfiniBand clusters using mlx5 hardware.
  • Virtualization and cloud hosts that expose Mellanox NICs or rely on accelerated network stacks.
  • Vendors and embedded appliances that include vendor‑forked kernels with delayed backports.
  • Any multi‑tenant hosts where kernel crashes produce outsized operational impact.
Practical exposure depends on whether your kernel contains the vulnerable commit(s). Upstream stable commits were released to address the issue; distributions will backport those commits into vendor kernel packages — verify package changelogs or vendor advisories to map the fix to your kernel version.

Exploitability and risk analysis​

  • Immediate exploitability: low for remote unauthenticated actors. The flaw is a robustness error that converts a local failure into a crash, not an obvious path to arbitrary code execution.
  • Realistic risk: local denial‑of‑service and potential for a crash during heavy allocation pressure or misconfiguration, which is significant for multi‑tenant and production hosts.
  • Chaining risk: as with many kernel faults, availability faults can be staging ground for more complex exploitation when combined with other primitives. Treat local kernel instability as a high‑value primitive for attackers with a foothold.
Public trackers and advisories do not document in‑the‑wild exploit chains for CVE‑2025‑40250 at time of publication; absence of a public PoC does not remove the operational urgency for patching in sensitive environments.

Detection and hunting guidance​

Because the primary symptom is availability loss rather than data leakage, detection relies on kernel telemetry and crash traces:
  • Kernel logs to watch for:
    • “Failed to request irq. err = -28” or similar request_irq failure messages from mlx5_core.
    • General protection faults referencing free_irq_cpu_rmap or similar IRQ free paths in the backtrace.
    • Repeated mlx5-related oopses or panics correlated with IRQ allocation events.
  • Practical commands and triage steps:
    1. Inspect dmesg or journalctl -k for mlx5_core and mlx5_ib messages around times of instability.
    2. Correlate host-level IRQ usage and vector exhaustion indicators (platform-specific tooling and kernel reports).
    3. If available, capture vmcore or kernel crash dumps for forensic analysis to preserve the backtrace and timestamp.
  • Hunting signals:
    • Frequent kernel oopses following allocation attempts for NIC/InfiniBand vectors.
    • Service interruptions on hosts known to run Mellanox devices during times of high allocation pressure.
Operational teams should centralize kernel logs and store vmcore artifacts where possible; automatic reboot policies can destroy vital forensic evidence. Early collection of dmesg and crash dumps increases the chance of root-cause identification. This approach aligns with prior kernel CVE triage guidance for similar driver failures.

Remediation: patching, verification, and mitigation​

Definitive remediation
  1. Apply a kernel package from your vendor or distribution that explicitly lists the upstream stable commit(s) addressing the mlx5 cleanup logic. Distributions will either ship a package that includes the stable commit or provide advisory mappings. Verify the package changelog references the upstream commit ID(s) before declaring a host remediated.
  2. Reboot into the patched kernel.
Staged rollout guidance
  1. Inventory: identify hosts with Mellanox devices (lspci, ethtool -i, uname -r, and kernel config checks).
  2. Pilot: deploy the patched kernel to a non‑production representative host and stress-test with workload patterns that previously triggered IRQ allocation or heavy device initialization.
  3. Roll out in waves: pilot → staging → production, monitoring kernel logs for two weeks after each wave for regressions.
Short‑term mitigations (if you cannot patch immediately)
  • Reduce IRQ vector pressure where possible (device configuration tuning, disable unused features that increase vector allocation).
  • Limit access to hosts with Mellanox devices or restrict untrusted workloads that may provoke heavy device initialization or firmware control paths.
  • If vendor images are patched but a full rollout isn’t possible, move critical tenants to patched hosts if feasible. These are stopgaps and not substitutes for applying the upstream fix.
Validation checklist after patching
  • Confirm kernel package changelog or vendor advisory lists CVE‑2025‑40250 or the exact upstream commit IDs.
  • Reboot and reproduce previously observed failure traces (if you have a lab test); absence of oopses during the same workload is a strong signal of remediation.
  • Continue monitoring kernel logs for related messages for a monitoring window appropriate to your environment.

Operational prioritization (who should patch first)​

  1. Multi‑tenant hypervisor and cloud compute nodes exposing Mellanox NICs (highest priority).
  2. InfiniBand/HPC clusters and RDMA-enabled storage hosts used in production.
  3. Virtualization hosts and appliances where an unexpected kernel panic could cascade through automation.
  4. Single‑tenant servers or developer workstations (lower, but still necessary, priority).
Kernel-level fixes carry reboot requirements; coordinate with platform owners and follow staged deployment practices that include rollback plans and validation windows.

Why the upstream approach is correct — strengths and remaining risks​

Strengths
  • The upstream fix is minimal and targeted, addressing only the incorrect cleanup scope. Minimal changes reduce regression risk and make backporting into stable trees easier for distributions and vendors.
  • The fix preserves normal behavior while closing the crash-inducing path, which is the preferred pattern for kernel correctness patches.
Potential residual risks
  • Vendor‑forked kernels and embedded images may lag on backports; such devices remain at risk until vendors ship patched images.
  • Detection gaps: automated reboots and suppressed panic logs can hide incidents. Centralized telemetry and crash‑dump preservation are essential.
  • Chaining: while the patch closes the immediate crash, availability primitives in the kernel remain attractive for attackers with local access; prioritize patching in multi‑tenant environments.

Concrete incident playbook (for responders)​

  1. If you observe an mlx5-related kernel oops or panic:
    • Preserve dmesg and vmcore, collect journalctl -k output, and snapshot /proc/interrupts and /proc/irq/* entries.
  2. Correlate crash time to device actions and workload traces (device probe, driver reloads, container/VM migrations, etc..
  3. If remediation is not immediately available, isolate impacted hosts from production pools and move critical workloads to patched or alternative hosts.
  4. Apply the upstream or vendor patch, reboot, and validate using representative workloads that previously triggered the condition.

Final assessment and recommendations​

CVE‑2025‑40250 is a typical example of a small, correctness‑focused kernel vulnerability that has outsized operational impact when left unpatched — a single failed request_irq should never break other mappings or invalidate state used by concurrent threads. Upstream maintainers implemented a precise and minimal cleanup change to ensure failed IRQ allocations remove only the intended mapping, which preserves invariants and restores robust behavior. Recommendations:
  • Treat this CVE as a high priority for hosts that run Mellanox/NVIDIA mlx5 hardware in production, especially multi‑tenant, virtualization, or RDMA/HPC environments.
  • Verify vendor kernel packages reference the upstream stable commit(s) and perform a staged rollout with post‑patch monitoring.
  • Ensure kernel crash telemetry (dmesg, vmcore, centralized logging) is enabled so allocation failures and oopses are visible and preserved for triage.
Caveat: public trackers and vendor advisories are the authoritative mapping of which packaged kernel releases include the fix. Confirm your vendor or distribution’s security tracker or package changelog for the exact kernel version that fixes CVE‑2025‑40250 before closing your remediation ticket.
The kernel patch restores a key safety invariant: error paths must clean up only what they created. That simple rule prevents local allocation failures from cascading into system‑wide crashes, and it’s precisely the kind of minimal, high‑value fix kernel maintainers prefer to apply quickly to stable trees.
Source: MSRC Security Update Guide - Microsoft Security Response Center
 

Back
Top