CVE-2024-41011: Linux AMD KFD MMIO Mapping Bug and Patch Guide

  • Thread Author
A subtle mapping bug in the Linux kernel’s AMD Kernel Fusion Driver (KFD) — tracked as CVE-2024-41011 — can cause the driver to map more MMIO (memory‑mapped I/O) space than intended on systems that use page sizes larger than 4 KB, exposing a full PAGE_SIZE of device MMIO instead of the single 4 KB HDP register window the code expects. The result is unnecessary and potentially dangerous access to GPU register space; the issue was fixed upstream in July 2024 and has been backported by distributions, but unpatched systems with AMD GPUs and the KFD/AMDGPU driver present remain at risk.

Stylized AMD CPU with a 4 KB MMIO window and CVE-2024-41011 reference.Background / Overview​

The bug lives in the drm/amdkfd code path — the part of the Linux DRM/AMDGPU subsystem that implements the AMD Kernel Fusion Driver (KFD) used by ROCm and other GPU compute stacks to manage compute contexts and MMU mappings for AMD GPUs. KFD provides userspace interfaces that allow compute workloads (and the drivers that orchestrate them) to interact with GPU hardware, including specific HDP flush registers that userspace can use to flush caches after modifying GPU-accessible VRAM.
On affected kernels the driver remaps those HDP flush registers into a small, unused 4 KB region of the GPU’s register BAR. The original intent is to expose exactly that 4 KB window so userspace (or the GPU itself) can access only the specific flush registers. On systems configured to use large page sizes (that is, page sizes larger than the canonical 4 KB), the mapping logic fails to use a 4 KB-aligned, 4 KB-sized mapping and instead ends up exposing the entire system PAGE_SIZE worth of MMIO. That mismatch means code or userspace that is permitted to access the remapped region can now read and write more MMIO registers than were intended — increasing the attack surface and the potential for information disclosure, hardware state tampering, or availability impacts.
Multiple vulnerability databases and vendor advisories classify the issue as a high‑impact, local vulnerability: NVD lists confidentiality, integrity and availability impacts as High and assigns a CVSS v3.1 vector showing a local attack vector and low attack complexity. Distributions and vendors published fixes and backports across stable kernel branches; the patch series was merged into multiple stable trees and distributed through vendor kernel updates.

Why this is dangerous (technical root cause)​

What is being mapped wrong​

  • The HDP flush registers are ordinarily remapped into a reserved 4 KB subwindow inside the GPU’s register Base Address Register (BAR) space.
  • When the kernel mapping code chooses a page mapping type that is larger than 4 KB (for example, 64 KB or hugepage mappings), the mapping granularity no longer matches the intended 4 KB target.
  • Instead of restricting access to the 4 KB HDP window, the kernel ends up mapping an entire PAGE_SIZE region of MMIO into the process’s address space — exposing registers located outside the intended remapped slot.

Consequences of extra MMIO exposure​

MMIO registers control hardware behavior. If an attacker (or even unprivileged but compromised userspace code) can access registers they shouldn’t, the following are realistic risks:
  • Confidentiality leakage: Some MMIO registers or registers’ side effects can leak internal GPU state or data derived from memory the GPU handles.
  • Integrity tampering: Unintended writes to control registers could alter GPU or system behavior, corrupt compute workloads, or subvert firmware-managed operations.
  • Availability / Denial‑of‑Service: Misconfigured writes to hardware control paths can hang devices, force resets, or trigger kernel oops/panics — producing sustained or persistent loss of availability. NVD explicitly flags high impact to confidentiality, integrity, and availability for this CVE.
All of these outcomes are particularly acute for multi‑tenant or shared platforms (cloud hosts, container hosts, multi‑user workstations) where GPU access is granted to untrusted or semi‑trusted workloads.

Who is affected​

  • Systems running Linux kernels that include the upstream drm/amdkfd code and a supporting AMD GPU are the primary population at risk.
  • The bug requires the KFD code path to be present; systems without AMD GPUs, or with KFD disabled/blacklisted at boot, will not be vulnerable in practice.
  • Multiple kernel series were patched; vulnerability tracking lists show the issue introduced in various release windows and fixed in specific stable kernel releases (for example fixes landed in stable trees with backports for the 5.4, 5.10, 5.15, 6.1 and newer stable series). OSV and distribution trackers list the affected ranges and the fixed versions.
The practical upshot: if you run AMD GPUs and your kernel predates the vendor/distribution fix (or you run a vendor kernel that has not incorporated the backport), you are exposed. Cloud images, virtual machine hosts that expose GPUs, machine learning servers, and developer workstations with ROCm or GPU compute stacks are the most likely to load KFD and therefore the most at risk. Vendor advisories (Debian LTS, Amazon ALAS, SUSE, Ubuntu and others) have published errata and patched kernels; operators should consult their distribution notices and apply the updates. (lists.debian.org)

Patch status and distribution guidance​

Upstream kernel maintainers merged one or more corrective commits into the stable trees; distributions produced updates for the major kernel branches and backported the fix into their long‑term support kernels. Public trackers and advisories list the fixed kernels and the vendor packages you should install. Representative vendor actions include:
  • Debian LTS published DLA-3912‑1 (linux 5.10.226‑1) which includes a fix that closes CVE‑2024‑41011 as part of a larger kernel security update. If you run Debian LTS kernels, apply the DLA update. (lists.debian.org)
  • Amazon Linux (ALAS) issued ALASKERNEL/ALAS advisories and published fixed kernel packages for both 5.4 and 5.10 kernel tracks.
  • SUSE and Ubuntu security trackers list the CVE and identify the patched package versions for their distributions.
  • The OSV database and CVE trackers list the exact kernel commits and fixed stable versions (for example, fixes were recorded as part of the 5.4.283, 5.10.225, 5.15.166 and 6.1.91 stable updates and equivalent backports). Systems using those fixed or later snapshots are no longer vulnerable in the kernel code path.
Because kernel packaging and backport timelines vary by vendor, your single best action is to apply the vendor-supplied kernel update for your distribution or cloud image. If you operate a custom kernel, merge the upstream stable patch series into your tree and rebuild, or upgrade to a stable version that already contains the fix.

Practical mitigations and temporary workarounds​

There are three practical, ordered steps operators can take immediately if they cannot apply a vendor kernel update quickly:
  • Patch as soon as possible. The only complete mitigation is to run a kernel version that includes the upstream fix or a vendor backport in your kernel package. Distributions have issued fixes; apply them. (lists.debian.org)
  • Limit or remove access to the GPU. If patching is delayed, restrict who can access /dev/dri and GPU devices:
  • Remove untrusted users from groups that control GPU devices (commonly the “video” group).
  • Revoke device access in container runtimes (do not bind /dev/dri into untrusted containers).
  • Disable passthrough of the GPU to guest VMs that host untrusted workloads.
  • Disable or blacklist KFD/amdgpu modules where feasible. Blacklisting the amdkfd or related modules will prevent the vulnerable code path from loading, at the cost of disabling GPU compute features (and in some cases display). For ephemeral test systems or hosts that do not require GPU acceleration, this may be an acceptable temporary mitigation. Note: blacklisting can be disruptive — know the operational consequences before applying.
Additional controls that reduce attack surface:
  • Enforce strict host isolation for GPU‑enabled hosts (no untrusted user logins).
  • Avoid allowing non‑privileged users to mmap or otherwise control GPU MMIO mappings.
  • Monitor audit logs for suspicious access to GPU device nodes and abnormal kernel oops or reset events.
Caveat: some of these mitigations (module blacklisting, full device lockdown) may not be viable for production GPU‑accelerated workloads. Applying the vendor kernel update remains the recommended course.

Exploitability, scoring variance, and what the numbers mean​

Public trackers show some variance in how the vulnerability is scored:
  • NVD reports a high score and lists High impact across confidentiality, integrity and availability (CVSS v3.1 vector with local attack vector and low privileges required).
  • Some vendors and advisories present different numeric scores or contextual severities (for instance Oracle’s internal CVE listing showed a comparatively lower impact in their advisory text, while other vendor trackers show medium). These differences reflect divergent scoring assumptions — such as whether the attacker needs credentials to access the device node, whether a target system actually exposes KFD to unprivileged processes, and how likely an attacker could reliably exploit register reads/writes to cause meaningful disclosure or privilege elevation.
Important nuance: the vulnerability is a local hardware‑interaction bug. Successful exploitation requires local code execution that can reach the GPU register mapping path, or processes that are already permitted to interact with GPU device resources. That requirement keeps large classes of systems safe by default (systems without AMD GPUs, or systems where GPU device access is tightly controlled), but it leaves shared GPU hosts and developer machines with less isolation exposed. The practical implication: investigate who can bind GPU reservations, open GPU device nodes, or run container images with GPU access on your infrastructure, and prioritize patching those hosts.

Detection and forensics​

There is no widely‑published public proof‑of‑concept exploiting this exact mapping error; public CVE and vendor notes do not report in‑the‑wild exploitation as of their advisories. That said, administrators should treat suspicious behavior around GPU device nodes and GPU resets as higher priority.
Look for these indicators:
  • Unexplained kernel oops, hung GPU driver messages, or device resets in kernel logs (dmesg/syslog) that coincide with processes that access /dev/dri or KFD interfaces.
  • Unexpected mmap() operations performed on GPU device character devices by userland processes (audit or system call tracing can reveal these).
  • Processes that create or manipulate GPU compute qr compute runtimes) without a clear business need.
If you detect anomalous access, isolate the host, preserve ring buffers and kernel logs, and capture a memory snapshot for offline analysis. Because this defect exposes MMIO ranges, suspicious register values and spurious writes in logs may be present around the time of an incident.

Why cloud and vendor attestations matter here​

Large cloud providers and enterprise vendors map upstream CVEs to their product inventories in different ways. Microsoft’s Security Response Center (MSRC) uses product‑scoped attestations such as “Azure Linux includes this open‑source library and is therefore potentially affected” to indicate their inventory findings for a given product image; this is useful as an operational attestation but doesn’t automatically translate to all Microsoft products or images running the same upstream code. The same caution applies for other vendors’ attestations: treat product‑level statements as scope‑limited and perform host‑level checks in your environment. (lists.debian.org)
Cloud operators should be particularly attentive: GPU‑enabled cloud instances and VM hosts that allow customer payloads to access GPU devices are higher‑risk. Ensure your cloud images receive the vendor kernel updates or supply patched kernels in your VM templates. Many cloud Linux images have already been updated in vendor channels — check the cloud provider’s security bulletins and the distribution CVE trackers for the exact image/kernel dates.

Step‑by‑step remediation checklist (recommended order)​

  • Inventory: identify all hosts with AMD GPUs or kernels that include the amdkfd module.
  • Use lsmod and /proc/devices to confirm KFD/amdgpu loading.
  • Patch: apply vendor/kernel updates that include the CVE fix (use your distribution’s kernel packages or merge upstream stable commits into custom kernels). Confirm package versions against the vendor advisory.
  • Restrict: until patched, remove untrusted user access to GPU device nodes and containers; blacklist KFD/amdkfd if you can afford to disable GPU compute.
  • Monitor: enable enhanced logging around GPU device nodes, watch for kernel resets and unexpected mmap activity.
  • Harden: for shared hosts, enforce stricter multi‑tenant isolation for GPUs (prevent untrusted tenant access to /dev/dri, avoid direct GPU passthrough where possible).
  • Review: after applying patches, validate that the amdkfd mapping behavior no longer exposes PAGE_SIZE beyond the intended 4 KB region (regression tests in staging are recommended for high‑risk hosts). (lists.debian.org)

Strengths of the upstream fix and remaining risks​

  • The upstream remediation is focused and surgical: the stable tree patches prevent large‑page mappings or otherwise ensure the mapping logic computes the correct offset and size, so the HDP flush register window is confined to the intended 4 KB range.
  • Distribution and vendor backports are available for major LTS and stable kernel branches; many mainstream distributions have already released updates.
Remaining caveats and risks:
  • Distribution lag and custom kernels: operators who run vendor‑modified kernels, heavily patched custom kernels, or uncommon LTS branches must verify the patch is present; a missing backport leaves the host vulnerable.
  • Operational impact of mitigations: blacklisting modules or disabling GPU access is disruptive for compute workloads; operators must balance security vs. availability and schedule patch windows accordingly.
  • Attack surface in shared GPU environments: cloud providers and multitenant GPU hosts must prioritize remediation because the downside of MMIO exposure in a multi‑tenant environment is higher than on single‑user developer machines.

Final assessment and actionable guidance​

CVE‑2024‑41011 is an availability‑and‑confidentiality relevant bug that results from incorrect MMIO mapping granularity in drm/amdkfd when systems use page sizes larger than 4 KB. The upstream fix is small and available; major distributions and cloud vendors published patched kernels and advisories. The vulnerability is local in nature — exploitation requires access to GPU device paths or userland code capable of mapping GPU MMIO — but that local requirement does not make the issue minor in practice: on shared GPU hosts the local vector is both realistic and operationally dangerous.
Action items for administrators (short list):
  • Inventory your fleet for AMD GPUs and kernels that include amdkfd.
  • Apply vendor kernel updates that contain the CVE fix as a priority.
  • Until patched, restrict GPU device access and remove untrusted users from groups that control GPU nodes; consider temporary module blacklisting if you can accept losing GPU compute.
  • Monitor GPU logs and audit for suspicious mapping or reset activity.
This vulnerability is a reminder that hardware register mappings are a sensitive boundary between kernel, userspace and device firmware — small miscalculations in mapping granularity can produce outsized security consequences. Treat GPU device access as a privileged resource, apply the fixes, and verify your environment’s attack surface for GPU‑exposed hosts. (lists.debian.org)
Conclusion: CVE‑2024‑41011 is fixed upstream and in distribution trees, but unpatched AMD GPU hosts remain at practical risk; patching and sensible device‑access restrictions are the reliable, immediate defenses.

Source: MSRC Security Update Guide - Microsoft Security Response Center
 

Back
Top