A subtle mapping bug in the Linux kernel’s AMD Kernel Fusion Driver (KFD) — tracked as CVE-2024-41011 — can cause the driver to map more MMIO (memory‑mapped I/O) space than intended on systems that use page sizes larger than 4 KB, exposing a full PAGE_SIZE of device MMIO instead of the single 4 KB HDP register window the code expects. The result is unnecessary and potentially dangerous access to GPU register space; the issue was fixed upstream in July 2024 and has been backported by distributions, but unpatched systems with AMD GPUs and the KFD/AMDGPU driver present remain at risk.
The bug lives in the drm/amdkfd code path — the part of the Linux DRM/AMDGPU subsystem that implements the AMD Kernel Fusion Driver (KFD) used by ROCm and other GPU compute stacks to manage compute contexts and MMU mappings for AMD GPUs. KFD provides userspace interfaces that allow compute workloads (and the drivers that orchestrate them) to interact with GPU hardware, including specific HDP flush registers that userspace can use to flush caches after modifying GPU-accessible VRAM.
On affected kernels the driver remaps those HDP flush registers into a small, unused 4 KB region of the GPU’s register BAR. The original intent is to expose exactly that 4 KB window so userspace (or the GPU itself) can access only the specific flush registers. On systems configured to use large page sizes (that is, page sizes larger than the canonical 4 KB), the mapping logic fails to use a 4 KB-aligned, 4 KB-sized mapping and instead ends up exposing the entire system PAGE_SIZE worth of MMIO. That mismatch means code or userspace that is permitted to access the remapped region can now read and write more MMIO registers than were intended — increasing the attack surface and the potential for information disclosure, hardware state tampering, or availability impacts.
Multiple vulnerability databases and vendor advisories classify the issue as a high‑impact, local vulnerability: NVD lists confidentiality, integrity and availability impacts as High and assigns a CVSS v3.1 vector showing a local attack vector and low attack complexity. Distributions and vendors published fixes and backports across stable kernel branches; the patch series was merged into multiple stable trees and distributed through vendor kernel updates.
Look for these indicators:
Cloud operators should be particularly attentive: GPU‑enabled cloud instances and VM hosts that allow customer payloads to access GPU devices are higher‑risk. Ensure your cloud images receive the vendor kernel updates or supply patched kernels in your VM templates. Many cloud Linux images have already been updated in vendor channels — check the cloud provider’s security bulletins and the distribution CVE trackers for the exact image/kernel dates.
Action items for administrators (short list):
Conclusion: CVE‑2024‑41011 is fixed upstream and in distribution trees, but unpatched AMD GPU hosts remain at practical risk; patching and sensible device‑access restrictions are the reliable, immediate defenses.
Source: MSRC Security Update Guide - Microsoft Security Response Center
Background / Overview
The bug lives in the drm/amdkfd code path — the part of the Linux DRM/AMDGPU subsystem that implements the AMD Kernel Fusion Driver (KFD) used by ROCm and other GPU compute stacks to manage compute contexts and MMU mappings for AMD GPUs. KFD provides userspace interfaces that allow compute workloads (and the drivers that orchestrate them) to interact with GPU hardware, including specific HDP flush registers that userspace can use to flush caches after modifying GPU-accessible VRAM.On affected kernels the driver remaps those HDP flush registers into a small, unused 4 KB region of the GPU’s register BAR. The original intent is to expose exactly that 4 KB window so userspace (or the GPU itself) can access only the specific flush registers. On systems configured to use large page sizes (that is, page sizes larger than the canonical 4 KB), the mapping logic fails to use a 4 KB-aligned, 4 KB-sized mapping and instead ends up exposing the entire system PAGE_SIZE worth of MMIO. That mismatch means code or userspace that is permitted to access the remapped region can now read and write more MMIO registers than were intended — increasing the attack surface and the potential for information disclosure, hardware state tampering, or availability impacts.
Multiple vulnerability databases and vendor advisories classify the issue as a high‑impact, local vulnerability: NVD lists confidentiality, integrity and availability impacts as High and assigns a CVSS v3.1 vector showing a local attack vector and low attack complexity. Distributions and vendors published fixes and backports across stable kernel branches; the patch series was merged into multiple stable trees and distributed through vendor kernel updates.
Why this is dangerous (technical root cause)
What is being mapped wrong
- The HDP flush registers are ordinarily remapped into a reserved 4 KB subwindow inside the GPU’s register Base Address Register (BAR) space.
- When the kernel mapping code chooses a page mapping type that is larger than 4 KB (for example, 64 KB or hugepage mappings), the mapping granularity no longer matches the intended 4 KB target.
- Instead of restricting access to the 4 KB HDP window, the kernel ends up mapping an entire PAGE_SIZE region of MMIO into the process’s address space — exposing registers located outside the intended remapped slot.
Consequences of extra MMIO exposure
MMIO registers control hardware behavior. If an attacker (or even unprivileged but compromised userspace code) can access registers they shouldn’t, the following are realistic risks:- Confidentiality leakage: Some MMIO registers or registers’ side effects can leak internal GPU state or data derived from memory the GPU handles.
- Integrity tampering: Unintended writes to control registers could alter GPU or system behavior, corrupt compute workloads, or subvert firmware-managed operations.
- Availability / Denial‑of‑Service: Misconfigured writes to hardware control paths can hang devices, force resets, or trigger kernel oops/panics — producing sustained or persistent loss of availability. NVD explicitly flags high impact to confidentiality, integrity, and availability for this CVE.
Who is affected
- Systems running Linux kernels that include the upstream drm/amdkfd code and a supporting AMD GPU are the primary population at risk.
- The bug requires the KFD code path to be present; systems without AMD GPUs, or with KFD disabled/blacklisted at boot, will not be vulnerable in practice.
- Multiple kernel series were patched; vulnerability tracking lists show the issue introduced in various release windows and fixed in specific stable kernel releases (for example fixes landed in stable trees with backports for the 5.4, 5.10, 5.15, 6.1 and newer stable series). OSV and distribution trackers list the affected ranges and the fixed versions.
Patch status and distribution guidance
Upstream kernel maintainers merged one or more corrective commits into the stable trees; distributions produced updates for the major kernel branches and backported the fix into their long‑term support kernels. Public trackers and advisories list the fixed kernels and the vendor packages you should install. Representative vendor actions include:- Debian LTS published DLA-3912‑1 (linux 5.10.226‑1) which includes a fix that closes CVE‑2024‑41011 as part of a larger kernel security update. If you run Debian LTS kernels, apply the DLA update. (lists.debian.org)
- Amazon Linux (ALAS) issued ALASKERNEL/ALAS advisories and published fixed kernel packages for both 5.4 and 5.10 kernel tracks.
- SUSE and Ubuntu security trackers list the CVE and identify the patched package versions for their distributions.
- The OSV database and CVE trackers list the exact kernel commits and fixed stable versions (for example, fixes were recorded as part of the 5.4.283, 5.10.225, 5.15.166 and 6.1.91 stable updates and equivalent backports). Systems using those fixed or later snapshots are no longer vulnerable in the kernel code path.
Practical mitigations and temporary workarounds
There are three practical, ordered steps operators can take immediately if they cannot apply a vendor kernel update quickly:- Patch as soon as possible. The only complete mitigation is to run a kernel version that includes the upstream fix or a vendor backport in your kernel package. Distributions have issued fixes; apply them. (lists.debian.org)
- Limit or remove access to the GPU. If patching is delayed, restrict who can access /dev/dri and GPU devices:
- Remove untrusted users from groups that control GPU devices (commonly the “video” group).
- Revoke device access in container runtimes (do not bind /dev/dri into untrusted containers).
- Disable passthrough of the GPU to guest VMs that host untrusted workloads.
- Disable or blacklist KFD/amdgpu modules where feasible. Blacklisting the amdkfd or related modules will prevent the vulnerable code path from loading, at the cost of disabling GPU compute features (and in some cases display). For ephemeral test systems or hosts that do not require GPU acceleration, this may be an acceptable temporary mitigation. Note: blacklisting can be disruptive — know the operational consequences before applying.
- Enforce strict host isolation for GPU‑enabled hosts (no untrusted user logins).
- Avoid allowing non‑privileged users to mmap or otherwise control GPU MMIO mappings.
- Monitor audit logs for suspicious access to GPU device nodes and abnormal kernel oops or reset events.
Exploitability, scoring variance, and what the numbers mean
Public trackers show some variance in how the vulnerability is scored:- NVD reports a high score and lists High impact across confidentiality, integrity and availability (CVSS v3.1 vector with local attack vector and low privileges required).
- Some vendors and advisories present different numeric scores or contextual severities (for instance Oracle’s internal CVE listing showed a comparatively lower impact in their advisory text, while other vendor trackers show medium). These differences reflect divergent scoring assumptions — such as whether the attacker needs credentials to access the device node, whether a target system actually exposes KFD to unprivileged processes, and how likely an attacker could reliably exploit register reads/writes to cause meaningful disclosure or privilege elevation.
Detection and forensics
There is no widely‑published public proof‑of‑concept exploiting this exact mapping error; public CVE and vendor notes do not report in‑the‑wild exploitation as of their advisories. That said, administrators should treat suspicious behavior around GPU device nodes and GPU resets as higher priority.Look for these indicators:
- Unexplained kernel oops, hung GPU driver messages, or device resets in kernel logs (dmesg/syslog) that coincide with processes that access /dev/dri or KFD interfaces.
- Unexpected mmap() operations performed on GPU device character devices by userland processes (audit or system call tracing can reveal these).
- Processes that create or manipulate GPU compute qr compute runtimes) without a clear business need.
Why cloud and vendor attestations matter here
Large cloud providers and enterprise vendors map upstream CVEs to their product inventories in different ways. Microsoft’s Security Response Center (MSRC) uses product‑scoped attestations such as “Azure Linux includes this open‑source library and is therefore potentially affected” to indicate their inventory findings for a given product image; this is useful as an operational attestation but doesn’t automatically translate to all Microsoft products or images running the same upstream code. The same caution applies for other vendors’ attestations: treat product‑level statements as scope‑limited and perform host‑level checks in your environment. (lists.debian.org)Cloud operators should be particularly attentive: GPU‑enabled cloud instances and VM hosts that allow customer payloads to access GPU devices are higher‑risk. Ensure your cloud images receive the vendor kernel updates or supply patched kernels in your VM templates. Many cloud Linux images have already been updated in vendor channels — check the cloud provider’s security bulletins and the distribution CVE trackers for the exact image/kernel dates.
Step‑by‑step remediation checklist (recommended order)
- Inventory: identify all hosts with AMD GPUs or kernels that include the amdkfd module.
- Use lsmod and /proc/devices to confirm KFD/amdgpu loading.
- Patch: apply vendor/kernel updates that include the CVE fix (use your distribution’s kernel packages or merge upstream stable commits into custom kernels). Confirm package versions against the vendor advisory.
- Restrict: until patched, remove untrusted user access to GPU device nodes and containers; blacklist KFD/amdkfd if you can afford to disable GPU compute.
- Monitor: enable enhanced logging around GPU device nodes, watch for kernel resets and unexpected mmap activity.
- Harden: for shared hosts, enforce stricter multi‑tenant isolation for GPUs (prevent untrusted tenant access to /dev/dri, avoid direct GPU passthrough where possible).
- Review: after applying patches, validate that the amdkfd mapping behavior no longer exposes PAGE_SIZE beyond the intended 4 KB region (regression tests in staging are recommended for high‑risk hosts). (lists.debian.org)
Strengths of the upstream fix and remaining risks
- The upstream remediation is focused and surgical: the stable tree patches prevent large‑page mappings or otherwise ensure the mapping logic computes the correct offset and size, so the HDP flush register window is confined to the intended 4 KB range.
- Distribution and vendor backports are available for major LTS and stable kernel branches; many mainstream distributions have already released updates.
- Distribution lag and custom kernels: operators who run vendor‑modified kernels, heavily patched custom kernels, or uncommon LTS branches must verify the patch is present; a missing backport leaves the host vulnerable.
- Operational impact of mitigations: blacklisting modules or disabling GPU access is disruptive for compute workloads; operators must balance security vs. availability and schedule patch windows accordingly.
- Attack surface in shared GPU environments: cloud providers and multitenant GPU hosts must prioritize remediation because the downside of MMIO exposure in a multi‑tenant environment is higher than on single‑user developer machines.
Final assessment and actionable guidance
CVE‑2024‑41011 is an availability‑and‑confidentiality relevant bug that results from incorrect MMIO mapping granularity in drm/amdkfd when systems use page sizes larger than 4 KB. The upstream fix is small and available; major distributions and cloud vendors published patched kernels and advisories. The vulnerability is local in nature — exploitation requires access to GPU device paths or userland code capable of mapping GPU MMIO — but that local requirement does not make the issue minor in practice: on shared GPU hosts the local vector is both realistic and operationally dangerous.Action items for administrators (short list):
- Inventory your fleet for AMD GPUs and kernels that include amdkfd.
- Apply vendor kernel updates that contain the CVE fix as a priority.
- Until patched, restrict GPU device access and remove untrusted users from groups that control GPU nodes; consider temporary module blacklisting if you can accept losing GPU compute.
- Monitor GPU logs and audit for suspicious mapping or reset activity.
Conclusion: CVE‑2024‑41011 is fixed upstream and in distribution trees, but unpatched AMD GPU hosts remain at practical risk; patching and sensible device‑access restrictions are the reliable, immediate defenses.
Source: MSRC Security Update Guide - Microsoft Security Response Center