The Linux kernel fix for CVE-2025-38211 closes a subtle but dangerous lifetime-management bug in the RDMA iWCM (InfiniBand/RDMA Connection Management) stack:
work objects allocated per cm_id could be used after they were freed, causing kernel memory corruption and deterministic crashes that reproduce with the nvme/061 blktests case.
Background
RDMA connection management in the kernel relies on paired objects: a public-facing cm_id and an internal cm_id_private structure that holds per-connection state and per-cm_id work items. A change made years ago to simplify cm_id lifetime handling—freeing cm_id when the last reference was dropped—introduced a race where a work item running an event handler could decrement the last reference while the destroy path freed the cm_id_private and its associated work entries. The result is a classic use-after-free in kernel context: a worker executes code that dereferences memory that has just been freed, often triggering a KASAN slab-use-after-free and a kernel oops.
Why operators should care: this is not a minor leak. The observable failure mode is deterministic availability loss—kernel oopses, KASAN reports and process crashes—triggerable from local contexts that can exercise the RDMA stacks (for example, during NVMe over RDMA tests), and therefore it’s an operationally serious kernel defect for any host that loads the RDMA/iWCM modules or exposes RDMA services.
What exactly went wrong — technical anatomy
At a high level the bug is a
lifecycle/race problem between (a) event handler work executing on a cm_id, and (b) cm_id destruction that frees per-cm work-object allocations.
- Historically, cm_id resources were freed when the last reference was removed. That sounds correct in principle, but the code paths that remove references include event handler work items themselves.
- One earlier fix forced a flush of pending work when destroying a cm_id, which prevented some use-after-free windows. However, separate per-cm work objects are allocated at cm_id creation (via alloc_work_entries()) and freed in dealloc_work_entries() when the cm_id’s last reference is removed. If the last reference is dropped from inside the worker that owns the work object, the work object can be freed while the worker still needs it — producing a use-after-free during the workqueue activation path. The kernel’s KASAN diagnostics exposed the failure as slab-use-after-free in __pwq_activate_work.
Concretely, the sequence looks like this:
- cm_id is created and per-cm work objects are allocated.
- An event is queued to the iWCM workqueue.
- The work handler runs, and as part of its processing it decrements the cm_id reference count to zero.
- The reference-drop path frees the cm_id_private and its per-cm work objects.
- The still-running work handler attempts to access the now-freed work objects, producing UAF and kernel oops.
The stable upstream remedy is surgical: ensure that the last decrement of the cm_id reference does not happen from inside the work context that owns the work object; instead, move iwcm_deref_id() out of destroy_cm_id() and call it from the contexts that call destroy_cm_id() after flushing pending work. In short:
flush pending works and ensure the last deref happens outside the work’s execution context so the work object cannot be freed while it is still executing.
Affected kernels, commits and patch status
The Linux kernel team assigned CVE-2025-38211 and mapped the introduction and upstream fixes to concrete commits and stable-tree backports. The vulnerability was introduced by commit 59c68ac31e15 (which simplified cm_id lifetime handling) and fixed in multiple stable releases with targeted commits. Public upstream announcements list the following representative fix commits and stable versions:
- Fixes landed in stable-tree series including 5.15.186 (commit 3b4a50d7…), 6.1.142 (commit 78381dc8…), 6.6.95 (commit 23a707bb…), and 6.12.35 (commit 764c9f69…).
Vendor advisories and distribution trackers incorporated those upstream changes into packaged kernel updates. Ubuntu’s CVE entry and AWS’s ALAS advisory both list CVE-2025-38211 and map the fix into their patch timelines; the public NVD summary mirrors the same actionable description. Operators should assume that kernels earlier than the vendor’s fixed package are vulnerable until the vendor-supplied kernel update is installed and the host is rebooted into it (or the vendor provides a livepatch).
How this was reproduced and where it shows up
The maintainers documented a reproducible test case: repeating the blktests "nvme/061" test against the rdma transport and the SIW (software iWARP) driver triggers the use-after-free. The reproduction path provides a reliable developer testcase to exercise the tree of calls that lead to the race described above and was used to validate the fix. The bug’s signature in logs is a KASAN slab-use-after-free trace pointing to __pwq_activate_work and worker thread stack traces for the iWCM workqueue.
Why that matters operationally: many RDMA setups will not run blktests, but the presence of any workload that uses RDMA verbs and that can cause event-driven cm_id destruction creates a sensitive exposure. User-space NVMe-over-RDMA clients, iSER or other stacks that interact with the iWCM code are the realistic vectors to trigger the event sequences in practice.
Severity, exploitability and threat model
- Scope: Local-only. The iWCM stack is reachable from local processes that can open or use RDMA connections or from tenants in multi-tenant clouds that are granted RDMA access. This is not a pure remote network attack unless RDMA services are exposed to untrusted workloads.
- Impact: Primary impact is availability. The defect reliably produces kernel oops/kernel crashes (KASAN evidence), which leads to service disruption or forced reboots. Some trackers also classify the defect as high because of the potential for kernel memory corruption, which in theory could be escalated, although a reliable RCE chain is not demonstrated publicly.
- CVSS and scoring: Public trackers and vendor feeds applied high-availability impact metrics; for example, some feeds record elevated CVSS scores reflecting AV:L and A:H (availability-high) consequences. Operators should treat hosts running RDMA stacks as high-priority candidates for patching.
- Exploitability: Low-to-moderate for DoS (straightforward to trigger locally), harder for remote code execution absent additional kernel vulnerabilities and environment-specific heap layouts. There are no widely-circulated public PoCs or active exploitation reports at disclosure time; nonetheless, the deterministic crash behavior makes denial-of-service a pressing operational concern.
Detection — what to look for in your telemetry
Practical indicators to detect attempted or successful triggering of this bug include:
- KASAN slab-use-after-free messages that reference __pwq_activate_work, iw_cm_workqueue, or related iWCM functions in kernel logs (dmesg/journal).
- Worker thread crashes or frequent kernel oops traces originating from kworker threads handling iwcm work items.
- Failures or test-case reproductions in blktests nvme/061 when executed against RDMA transports in a test lab.
- Elevated kernel panics or unexpected reboots in systems that provide RDMA services (iSER/NVMe over RDMA/iWARP), particularly during teardown or connection-close events.
Operational detection steps:
- Check whether RDMA modules are present and loaded: examine lsmod or /proc/modules for iw_cm, iw_cma, siw, mlx5 or other RNIC drivers. If modules are not present, the kernel is not executing the iWCM paths in question.
- Inspect dmesg/journal for KASAN or refcount underflow warnings mentioning iw_cm or workqueue activation traces.
- If you operate test harnesses, run blktests nvme/061 in a controlled environment to reproduce and validate fixes before rolling into production.
Mitigations and remediation recommendations
The only correct remediation is to install vendor-supplied kernel packages that include the upstream backported fixes and then reboot into the patched kernel (or apply an official vendor livepatch where available). Vendor advisories from mainstream distributions (Ubuntu, Debian, AWS/ALAS and others) mapped the upstream fixes into stable kernel packages—installing those packages is the supported path.
If an immediate kernel upgrade is impossible, consider the following stopgaps (temporary and imperfect):
- Restrict RDMA access: prevent untrusted users/containers from creating or destroying RDMA connections; remove device access (unbind RDMA devices from untrusted namespaces) so tenant code cannot drive cm_id lifecycle events. This reduces attack surface but may be impractical for production RDMA services.
- Isolate RDMA hosts: place RDMA-capable nodes into management VLANs or behind strict access control until they can be patched.
- Prevent workloads from triggering teardown flows: limit or throttle test workloads or guest operations that create and destroy many RDMA sessions rapidly. Use cgroups or orchestration constraints where feasible.
Short term steps (0–72 hours):
- Inventory: list hosts with RDMA stack loaded (lsmod, lspci to detect RNICs).
- Prioritize: schedule patching windows for hosts that are both vulnerable and exposed to untrusted workloads.
- Patch: apply vendor kernel updates; validate in staging prior to mass rollout.
Longer term:
- Integrate kernel CVE tracking into image build and validation pipelines.
- Where uptime windows are constrained, prefer vendor livepatch offerings vetted for kernel stability and compatibility for your RNICs.
Why the upstream fix is sensible — and what residual risk remains
Strengths of the fix:
- The upstream correction is narrowly scoped and addresses the root cause by changing the order of dereferences and ensuring flushing of pending work prior to the last deref. That makes the fix low-risk to accept into stable kernel trees and straightforward to backport.
- The fix preserves normal performance characteristics because it doesn’t fundamentally change the workqueue architecture; it simply enforces correct ordering for resource releases.
Residual operational risks:
- Backport variability: vendors backport fixes differently and ship them at different times and in different kernel package names. Operators must rely on vendor advisories and package changelogs, not only upstream commit IDs, to validate remediation.
- Long-tail devices: embedded appliances, vendor kernels, and marketplace images that do not regularly update kernels can remain vulnerable for extended periods. These endpoints require inventory controls and vendor engagement.
- Exposure via RDMA in multi-tenant clouds: if a cloud or hosting environment exposes RDMA to tenants, untrusted tenants may be able to exercise the local attack vector. Isolation and strict device access controls remain crucial until patches are deployed.
A general kernel engineering note: this kind of fix follows a recurring correctness pattern in kernel device driver work—
always cancel or flush pending work before freeing the objects the work might reference. That discipline appears across multiple unrelated fixes in driver code; operators should treat similar patterns in other drivers as potential hazard points.
Practical checklist for administrators
- Inventory: identify hosts that load iWCM or that physically present RNICs (check lsmod, lspci and vendor tools). Mark those hosts as high-priority.
- Consult vendor advisories: check your distribution’s CVE advisory for CVE-2025-38211 and map to the package name and fixed version. Do not rely on generic CVE text alone—install the vendor-supplied package.
- Stage and test: validate the vendor kernel update in a staging environment that exercises RDMA workloads and the NVMe-over-RDMA testcases where applicable. Reproduce blktests nvme/061 in staging to ensure the fix prevents the regression.
- Rollout: schedule patch windows and apply updates; reboot hosts into the patched kernel. If vendor livepatches are offered and vetted, follow vendor guidance for critical hosts where reboots are costly.
- Post-patch verification: confirm the kernel package includes the relevant upstream commit or that vendor changelogs list CVE-2025-38211. Check dmesg/journal for absence of KASAN UAF traces after applying the patch.
Critical assessment and final takeaway
CVE-2025-38211 is a textbook kernel lifecycle race that produced a reliable use-after-free in the RDMA iWCM subsystem. The vulnerability’s danger lies less in exotic exploitation and more in its deterministic capacity to
remove availability from hosts that run RDMA stacks—hosts that often play critical roles in storage clusters, HPC, and certain cloud service profiles.
The upstream fix is technically correct and conservative: ensure flushing of pending work and move the last dereference out of contexts where the work itself can cause the last deref. Because the change is narrowly scoped, vendors successfully backported it into multiple stable kernels and published advisories. Operators must respond in kind: inventory, stage, patch, and verify.
Two practical lessons emerge:
- Maintain an accurate runtime inventory of kernel modules and device drivers across your estate; knowing which hosts load iWCM or have RNICs is the single biggest triage gain for this CVE.
- Treat UAF fixes in kernel drivers as high-priority when they affect networking or storage datapaths; availability failures at the kernel level are operationally disruptive and can ripple across clusters and tenants.
Apply the vendor-supplied kernel updates for CVE-2025-38211 as soon as feasible, validate in staging against RDMA workloads and the NVMe/RDMA testcases, and if necessary, temporarily reduce RDMA exposure until the patches are in place. The fix is available; the operational work is inventory, testing, and rollout.
Conclusion: CVE-2025-38211 underscores the recurring kernel-dev principle that
work must not outlive the objects it references. The Linux community patched the iWCM lifetime bug with focused, low-risk commits; system operators now bear the responsibility to apply those patches and to harden RDMA-exposed hosts until fully remediated.
Source: MSRC
Security Update Guide - Microsoft Security Response Center