Linux NVMe-FC CVE-2025-40261: Fix for I/O error workqueue race

  • Thread Author
Blue holographic 'cancel' and 'ioerr_work' icons float above server racks as a red IO timeout kernel bug message.
A subtle ordering bug in the Linux kernel's NVMe over Fibre Channel (nvme‑fc) driver has been assigned CVE‑2025‑40261 and fixed by a small but critical change: ensure the work item used to report I/O errors (->ioerr_work) is cancelled only after the controller's transport association is fully torn down. The defect can permit a workqueue task to run against a freed controller object, triggering list corruption and kernel oopses; the practical result is an unstable system that can crash under specific I/O error conditions. This article explains what happened, why the fix works, who is affected, how to detect the problem, and practical mitigation and remediation strategies for production environments that use NVMe‑FC storage.

Background​

NVMe over Fibre Channel (nvme‑fc) integrates the NVMe storage protocol with Fibre Channel transports. That code lives inside the Linux kernel and must coordinate asynchronous events: I/O completion callbacks, transport events from the Fibre Channel stack, and deferred work queued to kernel workqueues. When the kernel needs to report an I/O error or clean up a controller, it frequently does so via workqueue tasks (work_struct / delayed_work). Proper ordering and synchronization around these tasks is essential — canceling, flushing and waiting for outstanding work is a routine but brittle operation if done in the wrong sequence.
The bug at the root of CVE‑2025‑40261 is a race between two parts of the driver:
  • nvme_fc_delete_association — the routine that tears down the transport "association" and waits for pending I/O to complete; and
  • ->ioerr_work — a workqueue item used to handle I/O error reporting and cleanup.
If the driver calls cancel_work_sync too early, and then nvme_fc_delete_association later causes ->ioerr_work to be queued (because an in‑flight I/O encountered an error during association deletion), the work can run after cancel_work_sync returned but while the nvme_fc_ctrl object is being freed. Running a work handler against freed memory corrupts kernel lists, resulting in list_del failures, lib/list_debug assertions and a kernel oops — that is, an immediate crash or panic. The practical symptom observed was list_del corruption and kernel BUG messages originating in list_debug.c during a kworker execution context.

What the patch changes (technical overview)​

The fix is intentionally small and surgical: the call to cancel_work_sync is moved so it executes after nvme_fc_delete_association completes. Conceptually this guarantees that the association teardown — and any I/O error reporting that may queue ->ioerr_work as part of that teardown — has already happened before the driver finalizes cancellation of the work item.
Why this matters:
  • cancel_work_sync waits for already-queued work to finish and prevents new work from running only at the time it is called.
  • If another path can queue the same work after cancel_work_sync returns, the guarantee fails and the work may run while the object has been freed.
  • Moving cancel_work_sync after the blocking association deletion removes the window where a new work item could be scheduled after cancellation returned.
This is a canonical ordering/race fix: ensure teardown that can produce new work happens before the cancellation and final freeing.

Impact: who is at risk and what it looks like in practice​

  • Affected component: the NVMe over Fibre Channel (nvme‑fc) driver within the upstream Linux kernel. Systems that rely on the kernel's nvme‑fc implementation are potentially impacted.
  • Practical impact: denial of service — kernel oops, crash, or panic stemming from list corruption caused by a workqueue handler accessing freed memory. In the field this manifests as unpredictable server reboots or hung systems that require manual recovery.
  • Likelihood of remote exploitation: there is no public indication this is remotely exploitable as a typical network‑style vulnerability. The condition requires NVMe‑FC I/O activity and error conditions during controller/association teardown. Systems not running NVMe‑FC, or systems where the driver module is not loaded, are not affected.
  • Common scenarios that can trigger the bug:
    • Active storage I/O workloads that hit transport failures or device errors while a controller association is being deleted.
    • Host‑initiated controller removal under I/O stress.
    • Unmanaged module unloads in the presence of concurrent I/O activity.
  • Severity in operational terms: a crash on a production host carrying storage (and therefore critical services) can lead to data unavailability, failover events, or application restarts. While the bug is not a remote arbitrary‑code exploit, its ability to crash a host running I/O workloads elevates it to a significant operational risk for storage servers and hypervisors using NVMe‑FC.

How the bug manifests (log messages and symptoms)​

Operators should watch for the following log patterns which indicate this exact failure class:
  • dmesg or kernel log entries showing list_del corruption or a kernel BUG in lib/list_debug.c.
  • Oops / invalid opcode messages in the kernel log with a workqueue backtrace (kworker threads) that mention the NVMe or NVME‑FC code path.
  • NVMe‑FC related messages preceding an Oops, e.g. I/O timeouts, "io timeout", "transport association event: io timeout abort failed" and subsequent controller resets.
  • Repeated panics or reboots that correlate with high I/O or storage error conditions.
Detection is straightforward by scanning kernel logs; example detection steps are detailed in the detection section below.

Quick detection checklist​

  • Look for the literal phrase "kernel BUG" followed by references to lib/list_debug.c in the kernel log.
  • Grep dmesg for NVMe‑FC tokens: NVME‑FC, nvme‑fc, nvme (device name), "io timeout", "transport association" and "list_del corruption".
  • If you see kworker backtraces and NVMe‑FC context near the crash, treat that as a strong indicator of this bug class.

Verification and timeline (what we can confirm)​

  • The issue was reported and patched upstream; the corrective change moves the cancel_work_sync call so that ioerr_work cannot be queued after cancellation.
  • The fix was authored and posted to the kernel stable patch queue by a maintainer; the patch description and mailing list entries confirm the root cause (ordering/race between cancel_work_sync and nvme_fc_delete_association and demonstrate the observed kernel oops trace used to justify the change.
  • The public vulnerability record was published in early December 2025, with upstream patch activity occurring in November 2025. The timeline shows the patch landed before public vulnerability announcements, which is the normal responsible disclosure path for upstream kernel fixes.
Note: specific kernel releases that include the fix will vary by distribution and stable kernel branch. Distributions that maintain long‑term kernel branches will backport the patch into their supported kernels on variable schedules, so the precise fixed package name and version should be confirmed against your distribution's security advisory.

Detection: practical commands and checks​

Use the following steps to detect whether you have seen or are seeing the failure:
  1. Search kernel logs for list corruption or kernel BUG:
    • dmesg | grep -iE "kernel BUG|list_del corruption|lib/list_debug.c"
  2. Search for NVMe‑FC events and I/O timeouts:
    • dmesg | grep -iE "NVME-FC|nvme.*io timeout|transport association"
  3. If you suspect an issue but haven't recorded logs, enable persistent logging (rsyslog/journald persistent storage) and reproduce under controlled, low‑risk conditions (maintenance window).
  4. Check loaded modules and whether nvme‑fc is active:
    • lsmod | grep nvme
    • modinfo nvme-fc
  5. Correlate with workload I/O patterns and storage fabric events (switch logs, FC fabric alerts, SAN controller logs).
Collect any oops output and stack traces for vendor support — the crash backtrace and associated dmesg lines help maintainers and distributors map the problem to the CVE and the underlying patch.

Mitigation and remediation guidance​

Priority remediation is to update the kernel to a version that contains the upstream patch or a distribution backport. Because the bug can crash a host, the fix should be treated as high priority in production NVMe‑FC deployments.
Recommended steps:
  1. Identify your kernel version and distribution kernel package:
    • uname -r
    • cat /etc/os-release (or check distribution package manager for kernel packages)
  2. Check your vendor/distribution security advisories for a fixed kernel package that addresses this NVMe‑FC issue. Apply the official kernel update from your distribution as the first option.
  3. If an immediate kernel update is not possible:
    • Temporary mitigation 1: Avoid controller/association deletions and module unloads during live high I/O workloads. Schedule maintenance windows for any change that will remove NVMe‑FC associations.
    • Temporary mitigation 2: If the system does not actively use NVMe‑FC, consider removing or blacklisting the module to prevent it from loading:
      • echo "blacklist nvme-fc" > /etc/modprobe.d/blacklist-nvme-fc.conf
      • Note: unloading the module from a running system that is actively using NVMe‑FC can itself trigger races and errors — avoid force removal in production.
    • Temporary mitigation 3: Reduce or quiesce I/O to NVMe‑FC devices before performing controller removal; ensure queues are drained and applications cleanly close handles before any driver/module teardown.
  4. After installing the kernel update, reboot the host to move to the fixed kernel image and verify dmesg no longer reports the previous failure patterns.
  5. Validate the fix in a test environment that reproduces your normal NVMe‑FC workload profile; stress tests that create controller/association delete events under load are the best way to validate the ordering fix.

Practical notes about module removal​

  • Removing the nvme‑fc module as a mitigation is only safe if no NVMe‑FC controller data paths are active. Attempting to unload the module while controllers or associations exist can hang or fail and might worsen the situation. Blacklisting to prevent module load on boot is safer when the host is not supposed to use nvme‑fc.

Remediation checklist (operational playbook)​

  1. Inventory: list hosts using NVMe‑FC and rank by criticality.
  2. Confirm vulnerability: search dmesg / journalctl for the crash pattern described earlier.
  3. Prioritize patching: schedule kernel updates for high‑priority hosts first.
  4. Patch and reboot: apply distribution kernel update and reboot during maintenance windows.
  5. Validate: run I/O tests that create association teardown to confirm the bug no longer appears.
  6. Monitor: keep enhanced kernel logging for a week to detect any residual regressions or related failures.

Risk analysis and critical commentary​

Strengths of the fix:
  • The patch is minimal and focused: adjusting the order of synchronization steps is a low‑risk change that directly removes the race window.
  • The fix addresses the root cause (ordering) rather than masking symptoms, and it preserves the existing cancellation semantics (cancel_work_sync still used) but in the correct sequence.
  • Because the change is local and small, it can be backported into multiple stable kernel branches with low likelihood of introducing unrelated regressions.
Residual risks and cautions:
  • Kernel workqueue races are notoriously subtle. While this particular ordering fix closes the documented window, other code paths elsewhere in the driver could still interact unexpectedly under abnormal I/O or transport failure conditions.
  • Distributions and vendors that backport fixes must ensure the patch is applied consistently across all supported kernel branches — mismatches can leave some hosts vulnerable even after vendor updates appear in other branches.
  • Operators may be tempted to force module unloads or other manual remedies; such actions can trigger or worsen races. Conservative, documented maintenance procedures are essential.
  • The fix prevents the specific crash described; it does not eliminate every possible NVMe‑FC stability issue (e.g., unrelated I/O timeouts, adapter firmware bugs, or firmware/driver mismatches). NVMe‑FC deployments require integrated validation across host drivers, HBA firmware, SAN fabric and storage controllers.
Attack and exploitation considerations:
  • There is no public evidence that this vulnerability leads to arbitrary code execution or allows remote attackers to gain control of a host. The practical effect is a local crash caused by normal kernel memory corruption due to asynchronous work running against freed objects.
  • Because storage commands and controller events are the trigger, a malicious actor would need access to the storage fabric or a compromised machine that can generate NVMe‑FC I/O to provoke the condition. In most deployments, that means the attacker would have to be on the storage fabric or have access to privileged I/O operations — a higher bar than pure network‑exposed services.

Testing guidance for administrators and QA teams​

To validate remediation in a safe environment:
  1. Recreate the environment: use the same kernel package, HBA driver version, and SAN configuration found in production.
  2. Reproduce the pre‑fix behavior: generate sustained I/O to NVMe‑FC devices while simulating transport or controller failures that trigger association teardown (this may require coordination with storage teams or test harnesses).
  3. Observe kernel logs: before the fix, you should be able to reproduce list_del corruption or Oops traces in the kernel log under the stress conditions. After the fix, those traces should no longer appear.
  4. Run long‑duration soak tests: keep I/O running and periodically instruct controller reattach/remove to exercise the association cleanup paths.
  5. Monitor for regressions: after applying the fix in production, keep a heightened monitoring window to capture any unexpected side effects early.

Practical business considerations​

  • For virtualization hosts or storage servers, schedule kernel updates as part of regular maintenance cycles, but accelerate updates for systems that rely on NVMe‑FC.
  • For environments running vendor kernels (vendor‑supplied enterprise distributions), wait for the official vendor advisory and use that kernel package, because vendor kernels often include additional vendor backports and integration tests.
  • Consider implementing infrastructure-level mitigations such as high‑availability clusters and automated failover for service continuity while hosts are being patched.
  • Ensure your change control process captures the dependency chain (HBA firmware, kernel module versions, distribution kernel packages) so that kernel updates don't unexpectedly break vendor-specific behavior.

Conclusion​

CVE‑2025‑40261 is an instructive example of how a tiny ordering mistake in kernel teardown logic can produce a serious, system‑level failure. The underlying defect is not exotic — it's a race between cancellation of a workqueue item and later code that can schedule that same work — but the consequences can be severe in production storage servers: kernel oops, list corruption and host crashes.
The remedy is straightforward: apply the upstream patch or vendor backport that moves cancel_work_sync to after the association deletion, test it under your workload, and deploy the fixed kernel into production. In high‑availability environments where NVMe‑FC is in use, prioritize the update and apply conservative operational controls (quiesce I/O, schedule maintenance windows) while you remediate. Finally, treat this as a reminder that asynchronous teardown and workqueue interactions are sensitive areas of kernel code and require careful sequencing and comprehensive testing in real workload conditions.

Source: MSRC Security Update Guide - Microsoft Security Response Center
 

Back
Top