Linux UFS Hang Fix CVE-2025-38119: Stable Patch for Availability

  • Thread Author
The Linux kernel received a targeted, low‑level fix addressing a hang in the UFS (Universal Flash Storage) SCSI error handler — a bug that can cause sustained or persistent loss of availability by deadlocking kernel threads during device error recovery. The change is small and surgical at the code level, but its operational consequences are tangible for systems that load the UFS host controller driver: administrators should treat this as an availability (DoS) issue, verify whether their kernels include the fixed commits, and apply vendor-supplied kernel updates or backports as soon as practical.

Linux icon linked to a UFS chip by a red knot, highlighting CVE-2025-38119 vulnerability.Background / Overview​

Universal Flash Storage (UFS) is a high-performance storage interface commonly used in mobile devices and embedded platforms, and its Linux host controller driver (ufshcd) lives inside the SCSI core tree. The recently assigned CVE‑2025‑38119 describes a hang that appears in the UFS error-handling path when a particular runtime‑power-management sequence interlocks with a driver state flag, producing a deadlock that prevents error recovery from completing.
The kernel community published the advisory and stable-tree backports to address this precise timing/ordering bug; the fix changes the point at which the driver marks error‑handling as “in progress” so that resumption routines can actually complete instead of being blocked by the very flag intended to mark recovery activity. The Linux kernel CVE announcement and vendor advisories summarize the problem and point operators to the exact stable commits containing the correction.

What happened (technical summary)​

The core defect in plain language​

At the heart of CVE‑2025‑38119 is a classic state-order problem in driver error recovery. The ufshcd driver’s error-preparation routine (ufshcd_err_handling_prepare()) sets a flag — UFSHCD_EH_IN_PROGRESS — to indicate the device is undergoing error handling. As part of the preparation, the code attempts to synchronously resume runtime‑PM (power management) with ufshcd_rpm_get_sync(). But ufshcd_rpm_get_sync() requires that the EH_IN_PROGRESS flag not be set because resuming performs SCSI command submission via ufshcd_queuecommand(). If queuecommand sees EH_IN_PROGRESS it returns SCSI_MLQUEUE_HOST_BUSY and resumption fails.
In the buggy sequence the driver sets EH_IN_PROGRESS before calling ufshcd_rpm_get_sync(), so the resume cannot complete and the caller waits for the resume to finish — producing a deadlock. The correct approach is to attempt the synchronous resume first and set the in‑progress flag only after the resume succeeds. The upstream patch implements precisely this ordering change.

Why this results in a hang (backtrace context)​

Public advisories and the upstream commit note a representative backtrace showing kernel schedulers blocked inside various wait/schedule primitives while a worker thread tries to perform runtime resume / power‑mode changes and stalls waiting on a SCSI START/STOP operation. The trace ends with worker threads stuck in the ufshcd error handler and the system unable to complete device recovery without external intervention (reboot or manual remediation). This is an availability issue rather than memory corruption or privilege escalation.

The upstream fix — what changed and why it works​

The code-level remedy​

The change is intentionally minimal and localized to drivers/ufs/core/ufshcd.c: rather than marking the driver as “error handling in progress” before calling ufshcd_rpm_get_sync(), the code now calls ufshcd_rpm_get_sync() first. Only after a successful resume is the UFSHCD_EH_IN_PROGRESS flag set.
This reordering prevents the resume routine from seeing its own blocker. By avoiding the lock‑or‑state inversion the code eliminates the deadlock window: if the resume cannot be performed, it returns an error and the error handler can proceed via its normal non‑blocking error paths instead of waiting forever. Kernel stable‑tree commits containing this change were merged and backported across several supported branches.

Why a tiny reorder is the right fix​

  • It preserves proper driver state semantics: in‑progress should mean “we have committed to recovery and have active resources that will complete,” not “we’re about to try to resume.”
  • It avoids turning the resume path into a victim of its own flag-based gating logic.
  • It is narrowly scoped and low‑risk compared with larger structural changes — making it suitable for stable-tree backports and vendor errata.
This pattern — ordering a state change before an operation the state will block — is a frequent source of kernel deadlocks and has historically yielded similarly surgical fixes. The underlying class of bug is well understood in kernel engineering: state must not prevent a needed operation from making forward progress. For a conceptual discussion of these deadlock patterns in device drivers, see the kernel driver recovery explainers used by maintainers.

Scope and exposure: who is affected?​

Affected component​

  • The bug resides in the Linux kernel’s SCSI UFS host controller driver (ufshcd), file path drivers/ufs/core/ufshcd.c.

Attack surface and exploitability​

  • Attack vector: Local (AV:L). The issue is triggered by device error/recovery and runtime‑PM interactions, not by unauthenticated remote traffic. That means an attacker would need local code execution or the ability to provoke device error/recovery sequences (for example, by causing the device to unexpectedly drop from the bus or by power cycling hardware).
  • Practical exposure depends on whether the host loads the ufshcd driver and uses UFS controllers. This is common on mobile and embedded Linux platforms and also appears in some OEM or appliance kernels for specialized hardware. Standard server-class x86 installations rarely include UFS controllers by default, but vendors ship many platform‑specific kernels where the driver is present.

Severity and CVSS​

  • Public tracking lists the issue as Medium with a CVSS v3 base score commonly reported around 5.5, reflecting a Local attack vector and Availability‑only impact. The risk is real for affected hosts because the kernel can deadlock, blocking I/O and potentially holding up clean shutdown or suspend/resume operations.

Evidence, patches and vendor response​

Upstream commits and stable backports​

The Linux kernel CVE announcement and stable‑tree mailing list provide the canonical explanation and link to the exact stable commits implementing the fix. Kernel maintainers merged the change and then propagated it into supported stable branches; maintainers list the affected file and provide commit IDs for packagers to reference when backporting.

Distribution and vendor advisories​

  • Ubuntu published a security advisory listing CVE‑2025‑38119 with the Medium priority and CVSS details, advising administrators to update to fixed kernel packages.
  • Oracle Linux and other vendor CVE trackers have indexed the same issue and show it as resolved in patched kernel trees; Snyk and other vulnerability databases also reflect the vendor mappings.
Where vendors manage custom kernel trees (OEMs, embedded vendors, appliance makers), they may either backport the single commit or include it as part of a point release. Operators should rely on their vendor’s kernel updates and errata rather than cherry‑picking upstream commits unless they are experienced with kernel maintenance and willing to accept the testing burden.

Detection and forensics — how to spot the bug in the wild​

Log and behavior indicators​

  • Kernel logs (dmesg / journal) showing worker threads stuck in ufshcd error handler routines or traces that match the published backtrace are the clearest signals. The public backtrace in advisories highlights waits inside schedule/io_schedule_timeout/blk_execute_rq followed by ufshcd_set_dev_pwr_mode and ufshcd_wl_runtime_resume sequences.
  • Symptoms: I/O hangs on the relevant device, failed runtime resume attempts in the kernel log, unresponsive worker threads, and possibly blocked shutdown or suspend operations that do not complete until a reboot. These are availability signals rather than integrity or confidentiality breaches.

Forensics checklist​

  • Preserve full kernel logs and any oops/panic dumps if the hang escalates to a kernel panic.
  • Capture the output of dmesg and journalctl -k and save a copy of /proc/interrupts and lsmod output to document module load state.
  • If possible, reproduce in a lab with identical hardware and kernel build to capture a deterministic trace before applying fixes.
  • Correlate device attach/detach events with the time of the hang to determine whether device faults or power transitions preceded the deadlock.

Mitigation and remediation guidance​

Immediate actions (0–48 hours)​

  • Inventory: Identify hosts that have UFS controllers and load the ufshcd driver: run lsmod | grep ufshcd or check kernel config and device lists. If you use device inventory tools, query for the presence of drivers/ufs/core consumer modules.
  • Logs: Scan kernel logs for traces that match the published backtrace or for repeated runtime‑PM resume failures tied to UFS devices.
  • Prioritize: For devices that are production-critical or located in multi‑tenant environments, schedule immediate remediation windows.

Recommended remediation​

  • Apply vendor-supplied kernel updates that include the stable‑tree commit for CVE‑2025‑38119 and reboot into the updated kernel. This is the preferred, supported option.
  • If you cannot immediately update kernels, contact the hardware/vendor support team to request an appliance/factory image patch or guidance. Embedded vendors often provide vendor-specific maintenance images.
  • As a temporary mitigation, if the hardware allows it, consider disabling the UFS controller driver (blacklist the module) for hosts where UFS devices are not required; note this will remove the device’s functionality. Use this only where functional impact is acceptable.

On cherry-picks and livepatches​

  • Cherry‑picking the upstream commit into a vendor kernel is technically possible but not recommended unless your team can run thorough regression testing. The kernel CVE announcement explicitly discourages unilateral cherry-picks for less experienced maintainers.
  • Livepatch services may or may not cover this particular fix — verify with your vendor whether a livepatch or ksplice-style binary patch is available and whether it is tested for your kernel ABI.

Operational analysis: why this matters beyond the single bug​

Availability as a first-class risk​

This CVE exemplifies how correctness issues in teardown and error‑recovery code produce outsized operational pain. A single thread deadlock in an error handler may be invisible until it occurs under a rare device failure scenario, but when it happens the result can be system‑wide service disruption, blocked shutdowns, or appliances that refuse to suspend or reboot cleanly.
For fleets and embedded appliances, the business impact is real: uncontrolled reboots, difficulty in automated recovery, and potential SLA breaches. The local attack vector reduces the odds of broad remote exploitation, but the vulnerability is valuable to an attacker who already has local presence or who can manipulate device redirection in virtualized environments.

The long tail: vendor kernels and debug builds​

  • Many enterprises run vendor-supplied or OEM kernels that are not updated on the same cadence as distribution kernels. Appliance maintainers and hardware OEMs often manage individual trees; they must be asked explicitly whether their images include the upstream fix.
  • Embedded devices and mobile platforms frequently use UFS; operators of specialized hardware must coordinate with vendors for firmware/kernel OTA updates.

Why small fixes are high-value​

The upstream change is deliberately small, making it easier for packagers to backport and test. Historically, the kernel community’s preference for surgical corrections in stable trees has ensured higher uptake of fixes like this one. The operational priority therefore becomes: identify affected hosts and install vendor updates promptly rather than attempting invasive engineering changes in production.

Practical checklist for administrators​

  • Inventory: find hosts with UFS controllers and ufshcd present.
  • Audit logs: search for runtime-PM/ufshcd resume failures and the supplied backtrace pattern.
  • Patch: install vendor kernel packages that map to the stable commits referenced by the kernel CVE announcement.
  • Test: in a staging environment, validate device attach/detach, suspend/resume, and error recovery after patching.
  • Hardening: restrict who can cause device resets or power-cycle operations in multi-tenant environments; for virtualized workloads, limit device passthrough/redirection that could let guests trigger host-side device state changes.
  • Monitor: add SIEM rules to detect kernel worker threads stuck during ufshcd error handling and alert for repeated runtime resume failures.

Cross‑validation and sources​

The technical description above is corroborated by multiple authoritative sources: the NVD entry for CVE‑2025‑38119 contains the same diagnosis and backtrace details published by the kernel CVE team, the official Linux kernel CVE announcement (linux‑cve‑announce) documents the affected file and provides the stable commit references, and vendor advisories (for example Ubuntu and Oracle’s CVE trackers) map the fix into distribution releases and list the CVSS assessment used to prioritize remediation. These independent confirmations make the diagnosis and remediation approach reliable for operators.
Note: the upstream commit and stable backport links referenced in the kernel CVE announcement are the canonical technical artifacts packagers should consult when mapping fixes to vendor kernels.

Risk summary and final recommendations​

  • Risk profile: Local, availability‑only. An attacker with local access or the ability to provoke device error/recovery can cause a kernel hang in ufshcd error handling. CVSS: Medium (≈5.5) as documented in vendor advisories.
  • Urgency: Moderate to high for hosts that actually load the UFS driver or run UFS hardware (embedded/mobile platforms, appliances). For general-purpose servers that do not expose UFS hardware, priority is lower.
  • Remediation: Apply vendor-supplied kernel updates that include the upstream stable commits, reboot, and validate recovery scenarios. If immediate patching is impossible, mitigate by removing or blacklisting the driver where feasible, restricting local device control, and scheduling a fast update window.

CVE‑2025‑38119 is a reminder that correctness in error and power‑management paths matters as much as correctness in fast I/O paths; a single ordering mistake can transform a transient device fault into a systemwide hang. The fix is targeted and low‑risk, and vendors have already folded it into stable kernels — the operational task now is straightforward: identify carriers in your inventory, apply vendor updates, and validate that your recovery and suspend/resume scenarios work cleanly after the patch.

Source: MSRC Security Update Guide - Microsoft Security Response Center
 

Back
Top