CVE-2025-38091: Linux AMD DRM DML21 NULL Plane ID Guard Fix

  • Thread Author
A subtle missing check in the Linux kernel’s AMD DRM display code has been cataloged as CVE-2025-38091 and corrected upstream; the defect can produce kernel warnings and, in some circumstances, a local denial-of-service by allowing the display stack to hit an oops when querying a plane identifier through the DML21 wrapper.

Glowing blue Linux penguin beside the AMD logo on a circuit-board with code snippets.Background​

The Linux Direct Rendering Manager (DRM) is the kernel subsystem that mediates GPU access for mode-setting, buffer management and hardware-accelerated rendering. Within that system the AMD display driver (commonly referenced as the amdgpu DRM stack) contains a family of hardware-specific helpers and wrappers — DML21 among them — used to map internal display constructs to hardware registers and orchestration code.
CVE-2025-38091 is the downstream label assigned to a defect in that code path: a missing or insufficient check when the DRM/DML21 wrapper attempts to derive a plane identifier (plane_id) from a supplied stream context. The symptom reported in kernel logs is a WARNING trace emitted from the dml2_map_dc_pipes call during a mode reset or recovery operation — in practice the warning appears when operators run the amdgpu recovery helper (for example, cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover) and the driver executes a path that expects a valid per-plane state where none exists. This behavior is described in the Linux CVE announcement and mirrored across multiple distributor trackers.

What exactly went wrong (technical summary)​

At the heart of the problem is an assumption in the AMD display resource mapping logic: when the code iterates display pipelines (often represented as struct pipe_ctx), it expects an associated plane_state pointer to be present when certain flags or conditions indicate an active plane. If that pointer is NULL and the code proceeds to read fields from plane_state (for example, to decide on scaling or to resolve plane identifiers), the kernel ends up dereferencing a NULL pointer. In the kernel that leads to an oops/warning and may require a reboot or module reload to restore normal display functionality.
Multiple independent summaries and patch notes describe the same root cause and the exact remediation: maintainers added defensive checks so code paths that call resource_build_scaling_params or that try to pull a plane_id from a DML wrapper now validate that the relevant plane_state (or stream id) is present before dereferencing. The upstream fix is deliberately small and surgical — the project’s guidance is to update kernels rather than attempt cherry-picks unless you maintain a tested backport pipeline.

The DML21 wrapper and plane_id: why the name matters​

DML21 is one of AMD’s driver-side wrappers that converts the kernel/driver’s abstract display model into the timing and pipe configuration used by the hardware. The wrapper contains helper functions that look up or compute per-plane identifiers. Because display pipelines are inherently asynchronous — hotplug events, mode changes, or resets can occur while the driver is working — the wrapper must guard against missing or transient state. The bug was a missing guard in the stream-id/plane-id resolution logic that, when exercised by a mode1 reset, surfaced as a warning and, under the right conditions, an availability-impacting kernel oops. This is the sequence that multiple advisories and the kernel announcement reproduce verbatim in their advisories.

Evidence, provenance and cross-verification​

This CVE was published on July 2, 2025 by the Linux kernel CVE announcement stream and has since been mirrored in the National Vulnerability Database and multiple distribution trackers. The linux-cve-announce mail entry provides the authoritative short description and points to the stable-tree commits that carry the defensive checks, with commit references you can use when mapping vendor patches to upstream code. That announcement also includes the exact kernel warning message that operators observed during testing.
The NVD entry and vendor trackers echo the same findings: a missing validation when querying plane_id in DML21 that can produce a false positive warning or an oops during mode resets. Distributors including Ubuntu, Debian and cloud-vendor advisories subsequently mapped the upstream commits into distribution package updates. In short, the vulnerability is well-documented and the fix is present in the stable kernel trees and in vendor kernels where the maintainers have applied the cherry-picks.
Two points of friction in the public record are worth noting. Several high-level trackers differ slightly on scoring: some sources present a CVSS around 5.5 (a medium base score aligned with an availability-focused, local vector) while others list a higher 7.8 scoring. These differences reflect divergent assessor choices for impact and scope rather than contradictory technical facts; the underlying technical reality remains a local availability issue fixed upstream. Operators should therefore treat distribution advisories and upstream commits as the single source of truth for remediation, while interpreting scoring variations as context-dependent.

Impact and exploitability: who should worry most​

  • Attack vector: local only. The vulnerable paths are triggered by code that runs inside the kernel as a result of local APIs interacting with the DRM device nodes (for example /dev/dri/*). That means a remote (network-only) attacker cannot trigger this condition unless they already have a way to run code locally or to cause local processes to exercise the vulnerable DRM paths.
  • Privilege required: low. Many legitimate user-space components — compositors, media players, GPU-accelerated workloads and containers with device access — open DRM device nodes without root. That makes the exploitability realistic on shared systems where untrusted users or processes have DRM access.
  • Impact focus: availability. The defect is availability-first: it can cause driver oopses, GPU hangs, repeated kernel warnings and in the worst case a system that needs rebooting or module unloading to recover. There is no authoritative public evidence that the bug alone results in arbitrary code execution, privilege escalation, or confidentiality loss; however, kernel oops primitives are often useful building blocks in complex exploit chains, so the conservative posture is to prioritize remediation where exposure exists.
  • High-value targets: multi-user servers, CI runners, virtualized hosts with GPU passthrough, container platforms where untrusted jobs might be allowed DRM access, and cloud images that expose GPUs to guests. Single-user desktops where only trusted software runs are lower priority but not immune.

Detection: what to monitor for​

Operators and SREs should add the following checks to their monitoring and log alerting:
  • Kernel logs (journalctl or dmesg) for warnings that include the dml2_map_dc_pipes function or references to dml2_dc_resource_mgmt.c and amdgpu. The original warning line frequently cited in advisories is a clear indicator that the code path was exercised.
  • Repeated amdgpu oops traces, pageflip failures, or sudden display hangs coincident with user activity that accesses DRM nodes.
  • Unusual probe/unprobe cycles of the amdgpu module or device nodes being repeatedly opened and closed.
  • For fleets, centralize kernel oops and driver WARN strings so patterns (repeated oops on multiple hosts after a similar workload) surface quickly.
If these patterns appear, treat the system as a high-priority remediation candidate even before a patch is applied; the occurrence of warning traces is a real operational issue rather than a mere verbosity problem.

Mitigation and remediation​

The complete fix is to install a kernel that includes the upstream stable commit(s) that add the necessary stream-id / plane_state checks. The linux-cve-announce message points to three stable-tree commits that contain the correction; maintainers recommend updating to a patched kernel release or applying the stable cherry-picks via your distribution vendor packages rather than ad-hoc local changes.
If you cannot immediately update the kernel, these temporary mitigations reduce exposure — each carries trade-offs and must be weighed against operational impact:
  • Restrict access to DRM device nodes. Tighten /dev/dri/* permissions so only trusted accounts and services may open the device files; remove untrusted users from groups that grant DRM access. This reduces the attack surface by preventing unprivileged, untrusted processes from exercising the vulnerable code path.
  • Unload or blacklist the amdgpu module. This guarantees the driver cannot be exercised, but it removes GPU acceleration and may render displays unusable on systems that depend on amdgpu for the primary console or X/Wayland sessions. Use only when you can tolerate the loss of GPU functionality.
  • Lock down container/device passthrough settings. Ensure containers are not given direct access to host DRM devices unless strictly necessary; use mediated devices or isolate GPU access via vendor-specific tooling if possible.
  • Where possible, limit non-root users’ ability to run arbitrary workloads that access GPU device nodes (e.g., by applying IAM controls, container runtime policies, or VM isolation).
These mitigations are intended for emergency response only. The recommended long-term solution is to install vendor-provided patched kernels and perform the normal testing and rollouts.

Patching — practical steps (a checklist)​

  • Inventory: Enumerate hosts with amdgpu in use. On each host, check whether the amdgpu driver is loaded and note the kernel version. Use package management or fleet telemetry to map kernel versions to vendor advisories.
  • Consult vendor advisories: Check your distribution’s security advisory (Ubuntu USN, Debian tracker, Amazon Linux ALAS, SUSE advisories, etc.) for the mapping from CVE to package version. Use the advisory to find the correct kernel package for your release.
  • Test: In a staging environment, install the patched kernel and validate critical display and GPU workflows: compositors, virtual GPU passthrough, CUDA/OpenCL jobs, and any vendor GPU tooling your workloads use.
  • Deploy: Schedule the kernel rollout across your fleet with standard change control. Because kernel updates typically require reboots, coordinate downtime for hosts where necessary.
  • Verify: After patching and rebooting, confirm there are no new dmesg WARN traces referencing dml2_map_dc_pipes or related amdgpu warnings. Also validate that workload performance and stability meet expectations.
  • Monitor: Continue to ingest and alert on kernel WARN/OOPS logs and update your fleet baseline to the patched kernel versions.
  • Document: For audit and compliance, record deployed kernel versions and the CVE mapping used for the patch cycle.

Critical analysis: strengths of the fix and residual risks​

Strengths
  • Scope and intent: The upstream fix is small, targeted and defensive. Rather than rearchitecting the display stack, maintainers added existence checks to avoid NULL dereferences, which is the correct engineering posture for a reliability issue landed in a mature codebase. The small patch scope makes it amenable for stable-tree cherry-picks and vendor backports.
  • Vendor response: Multiple distributors and cloud vendors have already mapped the upstream commits into their advisories and shipped patched kernels. That indicates the kernel community judged the fix safe to backport and vendors prioritized it for release.
Residual risks and caveats
  • Backport completeness: Small fixes are sometimes backported inconsistently across vendor kernels (different stable branches, different patch stacks). Operators should verify that the patch applied in their vendor kernel covers all code paths that could be hit in their environment — vendor changelogs and stable commit lists are your friend. The linux-cve-announce message includes commit references specifically for that reason.
  • OOT (out-of-tree) and unsigned modules: Systems running third-party or out-of-tree GPU modules can still be vulnerable even if the mainline kernel is patched. Tainted kernels and unsigned modules can reintroduce unsafe code interacting with the DRM stack; validate your module inventory and prefer vendor-supplied drivers where possible.
  • Attack chaining: While this defect is availability-focused, any kernel oops primitive can be reused in more complex exploit chains if other memory-safety issues exist. Hardening and monitoring of kernel telemetry remain essential defenses even after patch deployment.
  • Operational impact of mitigations: Blacklisting amdgpu or restricting device nodes has real operational costs. Test mitigations carefully before applying them to production to avoid unintended outages.

Recommended prioritization​

  • Immediate (highest): Shared hosts with untrusted local users, container hosts with device passthrough, CI runners that run unverified workloads, and cloud hosts offering GPU access to multiple tenants. Patch these systems first.
  • High: Workstations used by developers that run arbitrary builds or tests and any systems used to process untrusted media or user-provided workloads.
  • Normal: Single-user desktop systems used by a single, trusted user — patch on the regular maintenance schedule unless your environment allows untrusted code to run.

Final takeaways​

CVE-2025-38091 is an availability-first vulnerability in the AMD DRM display stack that arises from insufficient validation when the DML21 wrapper attempts to derive a plane_id. The defect is real, well-documented and fixed upstream in the stable kernel trees; vendor advisories and distribution updates have followed. The fix is a classic defensive kernel-patching story — add the missing NULL/stream checks, cherry-pick to stable branches, and ship the updated kernel packages. Operators should treat multi-tenant and GPU-exposed systems as high priority for patching, use temporary mitigations only as stopgaps, and verify that vendor-provided updates include the correct stable commits before rolling them out. Vigilant log monitoring for the characteristic dml2_map_dc_pipes warning and careful inventory of DRM device access are immediate practical steps you can take while scheduling the kernel updates outlined above.

Source: MSRC Security Update Guide - Microsoft Security Response Center
 

Back
Top