CVE-2025-40146: Linux blk-mq nr_requests deadlock fix for availability

  • Thread Author
Diagram of blk-mq flow: nr_requests to scheduler, illustrating CVE-2025-40146 upstream fix.
A recent upstream Linux kernel patch closes CVE-2025-40146, a subtle but practical concurrency bug in the block multi-queue (blk-mq) layer that could deadlock I/O when the sysfs attribute nr_requests is grown; administrators and cloud operators should treat this as an availability‑first risk and apply vendor kernel updates or backports as soon as they are available.

Background​

What blk-mq and nr_requests do​

The blk-mq (block multi-queue) subsystem is the modern Linux block layer designed to scale high-throughput storage I/O by splitting work across multiple hardware queues and minimizing global locking. The I/O scheduler abstraction (the elevator) attaches scheduler state to queues and exposes tunables—among them nr_requests—via sysfs so administrators can change queue depth at runtime to tune latency and throughput. Code that changes queue depth must carefully coordinate memory allocation, object ownership, and queue freeze/unfreeze semantics to avoid races and deadlocks.

How a runtime change can become a deadlock​

Changing nr_requests at runtime causes the block layer to allocate or free per-hctx scheduling structures (commonly called sched_tags or tags). If allocation or freeing happens while the queue is in a frozen state, and the freeze/unfreeze control paths are not ordered to guarantee safe lifetime transitions, the code paths that allocate/free tags can end up waiting on each other or on a context that cannot progress — producing a hang or kernel deadlock. The CVE description summarizes the fix succinctly: allocate memory before freezing the queue, and free memory after the queue has been unfrozen.

What CVE-2025-40146 actually is​

Technical summary​

CVE-2025-40146 is catalogued as "blk-mq: fix potential deadlock while nr_requests grown." The defect arises when tags are allocated and freed around operations that freeze the request queue; because the code performed allocation/free while the queue could be in a frozen state, certain timing windows allowed two or more code paths to block on each other. Upstream's corrective approach is defensive: ensure allocations happen before the queue freeze, and the corresponding frees are deferred until after the queue has been unfrozen. This converts a timing-dependent deadlock into a deterministic and safe sequence.

Evidence in the public trackers and upstream trees​

Multiple vulnerability trackers and open-source vulnerability feeds list the CVE and reference the upstream stable kernel commits that implement the fix. OSV and other aggregators map the introduction and fix commits for the affected git tree; those commit IDs are the authoritative record packagers and administrators should use when mapping their kernel builds to the remediation. Note: direct access to the git commit pages may require access to kernel.org mirrors in some environments; corroborating the commit presence in your vendor’s package changelog is the most reliable verification method.

Deep technical analysis​

Where the race lives​

The root cause is a classic ordering and lifetime problem in concurrent kernel code that manages data structures across multiple execution contexts. When the administrator increases nr_requests via sysfs, the kernel path responsible for updating queue depth may perform one or more of these actions:
  • allocate new sched_tags to support the larger request depth,
  • swap pointers inside elevator state that refer to tag structures,
  • freeze the queue to perform structural modifications,
  • free old tag structures after swapping.
If a path allocates or frees tags while the queue is frozen or while other CPU contexts hold complementary locks, a second thread executing a complementary path may attempt to allocate/free the same resources or wait on locks that will never be released until the first path completes—hence a deadlock. The minimal, correct approach is to allocate any needed resources before entering the frozen state and defer frees until after unfreeze to avoid entangling allocation/free with the freeze lifecycle.

Why this is an availability bug, not a confidentiality or integrity problem​

The observable consequence is denial of service: blocked block-layer workqueues, hung I/O, and potentially system-wide stalls that require manual intervention or a reboot. There is no published evidence that this race directly leads to arbitrary code execution or data exposure; the impact is availability-first. That said, availability bugs in shared or multi-tenant environments are high-risk operationally: a malicious or misbehaving local user or tenant can deliberately trigger the condition to take down services. Several vulnerability feeds classify the attack vector as local and the primary impact as high on availability.

The upstream fix pattern​

Upstream added a small, surgical change: allocate new tag structures prior to freezing the queue (so allocation cannot block on the freeze contract) and ensure any old tag structures are freed only after the queue is unfrozen. This is intentionally conservative: it preserves existing semantics for the common paths but closes the narrow timing window that allowed a deadlock. Because the change is minimal and localized, it backports cleanly to stable trees and is unlikely to introduce regressions — which is why maintainers applied it to the stable branches referenced in the CVE metadata.

Who is at risk and how to prioritize remediation​

High-priority environments​

  • Cloud hypervisors and multi-tenant hosts where untrusted guest or container workloads share a host. A local tenant with access to device nodes or the ability to cause sysfs updates can trigger the race.
  • Storage servers and IO-heavy production systems that rely on predictable I/O latency and queue performance; a blocked blk-mq path can cause cascading service failure.
  • CI/CD runners, build servers, and developer workstations that allow execution of untrusted or third‑party code by multiple users.

Lower-priority environments​

  • Single-tenant workstations where local privilege is tightly controlled and the attack surface for triggering nr_requests changes is limited.
  • Devices and images that never load the blk-mq elevator code or that run custom kernels without the affected code paths — but be cautious: many vendor kernels re-use this code.

Exploitability and attacker prerequisites​

Public trackers list the attack vector as local with low required privileges in some configurations, primarily because some systems expose sysfs controls or device interfaces that let unprivileged or lightly privileged users exercise the affected code paths. There are no credible reports of remote, unauthenticated exploitation that bypasses local control requirements. Nonetheless, in shared infrastructure the impact is effectively remote because an unprivileged guest can disrupt the host’s services.

Detection, hunting, and validation​

Quick telemetry checks (search across hosts)​

  • Check kernel logs for signs of blocked block layer workers or repeated stack traces referencing blk-mq and elevator code paths:
    • journalctl -k | egrep -i 'blk_mq|elevator|nr_requests|sched_tags'
    • dmesg | egrep -i 'blk_mq|elevator|nr_requests'
  • Search for repeated reboots or crashes correlated with block device operations in your SIEM logs and monitoring dashboards.
  • If you have kdump or vmcore collection enabled, preserve the memory image for forensic analysis; kernel OOPS traces vanish on reboot.

Repro and validation in test labs​

  • Build or run kernels that map to the upstream commit IDs cited by vulnerability trackers and attempt a controlled nr_requests increase while exercising heavy concurrent I/O on the target queue.
  • After applying the vendor kernel update, reproduce the same workload and confirm no deadlock or long hangs occur; verify dmesg and journalctl no longer contain the previous stack signatures.

Verifying the patch is present in your kernel package​

  1. Identify the kernel version: uname -r
  2. Check your distribution package changelog for the CVE or commit ID (for example, rpm -q --changelog kernel | grep -i 'CVE-2025-40146' or apt changelog equivalents).
  3. For source-built kernels, search the tree for the upstream commit ID(s) referenced by OSV/NVD and inspect the diff to confirm the defensive allocation/unfreeze ordering. Note that some public git.kernel.org pages may require connectivity or access; vendor changelogs are the reliable source for package-level assurance.

Remediation and mitigation — step-by-step​

  1. Inventory: Identify hosts that run kernels built from trees that include the blk‑mq code (uname -r; map kernel package -> upstream commits). Use endpoint management or orchestration tools to collect kernel versions fleet-wide.
  2. Prioritize: Tag multi‑tenant and I/O-critical hosts as high priority for patching.
  3. Patch: Install vendor/distribution kernel updates that include the upstream commit(s) fixing CVE‑2025‑40146. If your distribution has not yet published a package, request a vendor advisory or backport timeline. Distributors and security feeds (OSV, NVD, vendor advisories) list the stable commit IDs you can use to verify remediation.
  4. Reboot or livepatch: Kernel patches require either a reboot into the updated kernel or application of a vendor-supplied livepatch if available. Schedule reboots in maintenance windows with staged rollouts.
  5. Validate: After update, run the test cases that previously reproduced the issue and monitor kernel logs for a minimum of 7–14 days for repeated signs.
  6. Short-term mitigations if patching is delayed:
    • Restrict untrusted local code execution and shell access on high-risk hosts.
    • Limit who can write to sysfs attributes controlling queue depth (use SELinux, AppArmor, or other LSM controls where possible).
    • Isolate critical block devices or place them on dedicated hosts.
    • Increase kernel logging collection frequency and preserve logs immediately on OOPS/warn events for triage.

Vendor coordination and long-tail risk​

Embedded, appliance, and vendor-forked kernels​

Small, surgical upstream fixes like this are straightforward to backport, but the long tail of embedded appliances, vendor-managed firmware, and special-purpose distributions can lag. If you rely on vendor-supplied images (NAS appliances, network gear, specialized appliances), open a support ticket and request explicit confirmation that the fix has been applied or scheduled for backport. Failure to get vendor confirmation should be treated as continued exposure.

Livepatch considerations​

If you operate high-availability systems where reboots are expensive and your vendor supplies livepatch capability, check whether the vendor’s livepatch catalog includes a fix for CVE‑2025‑40146. Livepatches are not universally available for every stable kernel release, and they vary by vendor support contract. When in doubt, plan for a staged reboot with rollback plans.

Why the fix is low-risk but urgent​

  • Low-risk: The upstream change is intentionally minimal — an ordering/lifecycle fix that allocates before freeze and frees after unfreeze. Small diffs are easy to review, test, and backport, which reduces regression probability.
  • Urgent: The vulnerability converts into an operational denial-of-service when triggered. In multi-tenant and cloud environments the practical impact is severe even though the attack vector is local. The recommended response is therefore rapid patch deployment in prioritized environments.

What to tell stakeholders (executive summary)​

  • Nature of the problem: A timing/race bug in the Linux block I/O layer (blk‑mq) that can cause a kernel deadlock when queue depth is changed (nr_requests).
  • Impact: Service availability and I/O reliability; does not currently show evidence of remote code execution or data leakage.
  • Risk posture: High for multi‑tenant and I/O-critical hosts; lower for single-tenant desktops with tightly controlled access.
  • Recommended action: Patch kernels from your vendor or apply backports; if you cannot patch immediately, tighten local access, isolate affected hosts, and increase kernel crash monitoring.

Critical appraisal — strengths, residual risks, and open questions​

Strengths of the upstream fix​

  • The change follows kernel maintainers’ preferred pattern of surgical, defensive modifications — small code paths that eliminate narrow timing windows without broad rewrites.
  • Small commits are easy for distributors to backport into stable releases, accelerating vendor patch delivery to end users.

Residual risks and operational caveats​

  • Vendor lag: appliances and OEM kernels may remain vulnerable for extended periods; operators must insist on vendor backports or plan replacement images.
  • Detection gaps: kernel OOPS traces and transient deadlocks may be missed if central logging or kdump is not in place; reboots erase transient evidence.
  • Compositional risk: similar allocation/freeze lifecycles elsewhere in the block layer or in adjacent subsystems may harbor additional timing bugs; this fix addresses a single, clearly identified window but not the broader class.

Unverifiable or uncertain claims​

  • Direct inspection of the upstream diffs on kernel.org was not available in all retrieval attempts due to access or fetch restrictions; vulnerability trackers and OSV list the authoritative commit IDs. Administrators should verify the presence of the specific commit IDs in their vendor package changelog rather than relying solely on upstream links when vendor pages are not accessible.

Appendix — practical commands and quick checks​

  • Identify kernel and package metadata:
    1. uname -r
    2. rpm -q --changelog kernel | egrep -i 'CVE-2025-40146|blk-mq|nr_requests'
    3. apt changelog linux-image-$(uname -r) | egrep -i 'CVE-2025-40146|blk-mq|nr_requests'
  • Search kernel logs for signs:
    • journalctl -k | grep -E 'blk_mq|nr_requests|sched_tags|NULL pointer dereference|deadlock'
    • dmesg | grep -i 'blk_mq'
  • Repro test sketch (lab only):
    1. Create controlled block device or loopback image.
    2. Drive heavy concurrent I/O workloads (fio or dd in parallel).
    3. Change nr_requests via sysfs while workload runs and observe system behavior.
    4. Repeat on patched vs. unpatched kernels to validate remediation.

Conclusion​

CVE-2025-40146 exposes a narrow timing and lifecycle bug in the Linux blk-mq subsystem: allocations and frees of scheduler tag structures crossing queue freeze/unfreeze boundaries can deadlock when nr_requests is grown. The upstream remedy is conservative and easy to backport — allocate before you freeze, free after you unfreeze — but the practical risk is immediate for multi-tenant and I/O-critical hosts. Administrators should treat this as a priority patch for affected kernels: inventory your fleet, confirm vendor packages include the upstream commits referenced in public trackers, schedule staged rollouts (or livepatch where supported), and harden detection and local-access controls until the update is widely deployed. The fix closes a long-standing, subtle corner-case that left hosts vulnerable to availability disruption; deploying it promptly reduces operational risk and preserves I/O reliability for your production services.
Source: MSRC Security Update Guide - Microsoft Security Response Center
 

Back
Top