CVE-2025-37854: Azure Linux amdkfd GPU Kernel Race and Patch Insight

  • Thread Author
Microsoft’s brief MSRC attestation that “Azure Linux includes this open‑source library and is therefore potentially affected” is an important, actionable inventory statement for Azure Linux customers — but it is not a categorical guarantee that no other Microsoft product can contain the same vulnerable Linux code. rview
On May 9, 2025 the vulnerability tracked as CVE‑2025‑37854 was published: a use‑after‑free race in the Linux kernel’s AMD Kernel Fusion Driver (KFD), the drm/amdkfd component, tied to the mode1 reset recovery path. The vulnerability can lead to driver crashes and corrupted kernel data structures when KFD attempts to recover a GPU while user‑space queues and the KFD cleanup worker race over the same system memory. The NVD entry summarizes the defect and the intended fix.
The upstream fix was accepted into the stable kernel trees and distributed by maintainers: the patch series and the key commit that “Fix mode1 reset crash issue” are documented in the AMD/DRM mailing list and the Linux stable patch autosend (the fixes pull); the patch notes explain the exact race and the remedy (terminate user queues, flush reset_domain workqueues to wait for ongoing resets to complete, then free outstanding buffer objects).
Why this matters: KFD (amdkfd) is the kernel driver that manages AMD GPU compute contexts (HSA/ROCm workloads). It is built into and distributed with many Linux kernels used by distributions, cloud images, and vendor kernels. Because the kernel source for drm/amdkfd is shared upstream, any downstream kernel build that includes the same upstream code and configuration can inherit this bug until the fix is applied.

Neon illustration of Azure Linux cloud linked to an AMD GPU on a glowing circuit board.Technical deep dive: what the bug actually does​

The sequence that leads to the race​

  • A hardware scheduler (HW scheduler) hang occurs on an AMD GPU that uses the KFD compute stack.
  • Operators or the driver trigger a mode1 reset to recover the GPU. Mode1 reset is a specific GPU reset path used during recovery in some AMD driver code paths.
  • KFD signals user space to abort the affected GPU processes.
  • After the user processes abort, user queues (jobs submitted by user space) can still touch system memory mapped to GPU buffers before the hardware reset fully completes.
  • Meanwhile, the KFD cleanup worker frees system memory and VRAM (buffer objects, or BOs) that it believes are no longer in use.
  • The race: freed memory can be re‑allocated by KFD for new objects while user queues are still able to write to the original addresses — producing a use‑after‑free and potential corruption of kernel data structures, leading to driver crashes or kernel instability.
This scenario and the upstream fix are described explicitly in the commit messages and the fixes pull request that was merged into the stable trees in early April 2025; the fix changes the cleanup order and forces a flush of reset_domain workqueues so outstanding resets complete before freeing BOs.

Practical implications of the bug​

  • Attack surface: local, low‑privilege operations that can submit GPU workloads or force GPU resets. This is principally an availability/stability issue rather than a direct, unauthenticated RCE vector.
  • High‑risk environments: multi‑tenant GPU hosts, GPU‑accelerated cloud nodes, AI/ML clusters, virtualized GPU passthrough and SR‑IOV deployments — places where GPUs are actively shared or reset frequently.
  • Exploitability: no widespread public exploit code was reported at publication; patches were accepted upstream and distributed by vendors/distributions. Several vendor trackers and OS security advisories list the CVE and provide patched package versions.

Microsoft’s attestation: what the words actually mean​

When Microsoft’s Security Response Center (MSRC) writes “Azure Linux includes this open‑source library and is therefore potentially affected,” that statement should be parsed precisely:
  • Authoritative for Azure Linux: The sentence is an inventory attestation. It means Microsoft iux distribution/kernel images and found the upstream component in those builds; therefore Azure Linux images are in scope for remediation. Treat that assertion as the authoritative signal for Azure Linux customers.
  • Not exclusivity: The wording is not an assertion that Azure Linux is the only Microsoft product that could possibly contain the vulnerable code. Microsoft has repeatedly used this product‑scoped phrasing for multiple kernel CVEs and has said it will update CVE/VEX entries if more Microsoft artifacts are found to be carriers. That operational practice and the public commitment to publish CSAF/VEX attestations (Microsoft started publishing machine‑readable VEX/CSAF artifacts in October 2025) underpins the short MSRC wording.
In short: Microsoft’s attestation is an important inventory result for Azure Linux — but absence of an attestation for another Microsoft product is not proof of absence of the vulnerable code in that product.

Is Azure Linux the only Microsoft product that could be affected?​

Short answer: No — but with a crucial operational nuance.
  • Microsoft has publicly attested that Azure Linux (the distro/kernel lineage Microsoft maintains for Azure ilnerable drm/amdkfd component for CVE‑2025‑37854 and is therefore potentially affected. That statement is actionable for any tenant running Azure Linux images.
  • However, the same upstream kernel source file (drivers/gpu/drm/amd/amdkfd/*) is used widely. Any Microsoft product or artifact that ships a kernel build compiled with the amdkfd component (either built‑in or as a module) could, in principle, carry the vulnerable code until Microsoft inventories thes an updated VEX/CVE mapping. Microsoft has explicitly committed to update CVE records if additional products are identified as impacted.
  • Practically speaking, additional Microsoft artifacts that might carry the code include — but are not limited to —:
  • Microsoft‑maintained WSL2 kernel builds (if those builds include the amdkfd driver),
  • Azure Marketplace VM images and custom marketplace kernels,
  • Managed node images or operating-system images used by Azure Kubernetes Service (AKS) or other managed hosting services,
  • Specialized Azure appliance images (for example, any Microsoft images for GPU‑accelerated hosts, bare‑metal/VM marketplace images, or edge appliances that include an upstream kernel with amdkfd compiled in).
None of the items above should be asserted as affected without an MSRC attestation or an independent inventory check; the point is operational possibility — these are the types of artifacts to check. The repeated advisory pattern Microsoft uses makes this distinction plain: they name the product they have checked and say they will update the record when they find more.

How to verify whether your Microsoft-provided images are affected​

If you run Microsoft images or manage Azure tenants, follow these steps to confirm whether a particular image includes the vulnerable amdkfd code and therefore needs patching.

1. Check MSRC VEX/CSAF mapping for the CVE​

  • Microsoft publishes machine‑readable VEX/CSAF attestations for many CVEs; when an attestation exists it identifies the specific product SKUs Microsoft has verified as carriers. If you see Azure Linux listed, that image is in scope; if other Microsoft SKUs are not listed, treat that as “nosoft has stated it will update the mapping if additional products are identified.

2. Inspect the runtime kernel on the target host​

Run these commands on any deployed Linux host (including Azure VMs, WSL2, AKS nodes, etc.) to check for presence of the amdkfd driver and relevant kernel config flags:
  • Check modules on disk and in the running kernel:
  • lsmod | grep kfd
  • find /lib/modules/$(uname -r) -type f -path '/drivers/gpu/drm/amd/amdkfd*' -print
  • Check for the amdkfd files:
  • zgrep CONFIG_AMDKFD /proc/config.gz || grep -i amdkfd /boot/config-$(uname -r)
  • Check the running kernel messages for GPU/KFD traces:
  • dmesg | grep -i kfd
  • journalctl -k | grep -i amdkfd
If the module or driver files exist in /lib/modules or the kernel has CONFIG_AMDKFD=y/m set, the runtime can mount that code path. Presence of the module indicates potential exposure until the kernel package is patched to include the upstream fix. (Commands are deliberately generic — adapt them for distribution specifics.)

3. Compare installed kernel package to vendor advisories​

  • Cross‑check the kernel package version against vendor advisories (Ubuntu, Debian, SUSE, Red Hat, and Microsoft where available).
  • OSV, NVD and distribution trackers list fixed stable commits and package versions for CVE‑2025‑37854 and related amdkfd fixes. Use the fixed package versions from your distro vendor as the remediation target.

4. For WSL2 and Azure-specific images​

  • WSL2 uses a Microsoft‑maintained Linux kernel. If you run WSL2 and suspect you use GPU features (e.g., WSL GPU compute, CUDA/ROCm integrations), verify the WSL kernel build shipped with your Windows version or the custom WSL kernel you installed. If Microsoft has not attested WSL in thit remains an artifact you must inventory locally (see steps above) rather than assume it’s unaffected.

Recommended remediation and mitigation steps​

  • Patch Azure Linux images immediately if you run GPU workloads on them. Microsoft’s attestation for Azure Linux makes it authoritative f; apply the updated kernel packages or image updates that include the amdkfd fix.
  • Inventory all Microsoft‑published images and kernel artifacts in your environment:
  • Marketplace images, custom VM images, AKS node images, managed host images, WSL kernels on managed endpoints.
  • For each artifact, check for the presence of amdkfd per the detection commands above.
  • If you run multi‑tenant GPU infrastructure (including GPU VMs or shared compute clouds), treat this CVE as higher priority: schedule reboots where necessary after applying patched kernels and monitor for GPU reset‑related messages that could indicate lingering regressions.
  • If you cannot patch immediately, reduce exposure:
  • Limit untrusted user access to local GPU job submission.
  • Disable or unload the amdkfd module where feasible on hosts that do not need AMD GPU compute features (modprobe -r amdkfd; blacklist the module until patched — note that removing a module may not be possible if it’s built‑in).
  • Use monitoring to detect repeated GPU resets or KFD cleanup errors (dmesg/journalctl alerts).
  • Request VEX/CSAF attestations from Microsoft for any Microsoft image types you cannot fully inventory or which are managed for you. Microsoft has said it will update CVE mappings when additional products are identified; proactively reshortens the time to authoritative status for those images.

Cross-checking the facts: upstream, vendor advisories, and Microsoft practice​

  • Upstream patch and commit: the fixes pull and the individual commit that addresses the mode1 reset crash are documented in the AMD/DRM mailing lists and stable fixes releases (pulls applied in April 2025). The patch text describes the exact lock/order/flush change used to close the race. This provides the primary technical reference for the remediation.
  • Vulnerability cataloging: multiple vulnerability databases and vendor trackers have cataloged CVE‑2025‑37854 and list remediation packages or kernel versions; OSV (osv.dev), NVD, and distribution trackers (Ubuntu, SUSE, Red Hat, Amazon Linux) show the patch and list affected/fixed package details. These independent records corroborate the technical description and the remediation timeline.
  • Microsoft’s public inventory practice: Microsoft’s MSRC product‑level wording — “Azure Linux includes this open‑source library and is therefore potentially affected” — has been the consistent phrasing used across many kernel‑level CVEs and is a product‑scoped attestation, not a proof of exclusivity. Microsoft also committed to publish CSAF/VEX attestations starting October 2025 and to update CVE entries if more products are found affected. The industry (and Microsoft itself) has repeatedly clarified that attested equals authoritative for the named product; lack of attestation equals unverified for others until they are separately checked.

Risk assessment: what administrators should prioritize​

  • Priority 1 — Azure Linux GPU hosts: irosoft’s attestation makes Azure Linux an explicit remediation target; apply patched kernels or updated images for those nodes.
  • Priority 2 — Shared GPU infrastructure and multi‑tenant clusters: patch quickly and schedule reboots; the operational cost of downtime is lower than the risk of intermittent but reproducible crashes or memory corruption on shared hosts.
  • Priority 3 — Other Microsoft artifacts (WSL2, marketplace images, AKS node images): inventory and verify. Do not assume these artifacts are unaffected just because they are not yet listed bunverified and check using the techniques above. Ask Microsoft for VEX/CSAF attestations where necessary.
  • Priority 4 — End‑user desktops or single‑tenant VMs without AMD GPUs: low priority; the vulnerability requires GPU compute paths and reset activity to trigger. If no AMD GPU or KFD is present, practical risk is minimal.

Strengths and limitations of Microsoft’s attestation approach​

Strengths​

  • Actionable clarity for Azure Linux customers: naming the product gives administrators a clear, immediate remediachine‑readable VEX/CSAF files increases automation possibilities for large fleets.
  • Commitment to expand mappings: Microsoft’s published commitment to update CVE records when additional products are found tostream code is positive and increases transparency over time.

Limitations / Risks​

  • Attestation is product‑scoped, not exhaustive: Microsoft’s phrasing can be misread — some operators assume “only Azure Linux” when the correct interpretation is “Azure Linux is one Microsoft product we have inventory‑checked.” Until all Microsoft artifacts are inventoried and attested asted, risk remains for unverified items.
  • Timing gap: there can be a lag between upstream fix acceptance and downstream image rollouts. Many distributions and vendor kernels integrated the patch swiftly; still, the window between publication and full fleet remediation requires active vendor and operator coordination.

Practical checklist for security teams (recommendedfor CVE‑2025‑37854 and note which Microsoft SKUs are listed.​

  • For every Microsoft image in use, run the kernel inspection checks (lsmod, find in /lib/modules, check config) to detect presence of amdkfd.
  • Patch Azure Linux hosts immediately with kernels that include the amdkfd fix (upstream commit applied in early April 2025; CVE published May 9, 2025).
  • Where patching cannot be immediate, consider unloading or blacklisting the amdkfd module if GPU compute is not required, and restrict user access to GPU submission paths.
  • Request VEX/CSAF attestations from Microsoft for any managed imah are not explicitly mapped. Microsoft has said it will update CVE mappings if impact to additional products is identified.
  • Monitor kernel logs for KFD/AMD GPU reset-related warnings and test GPU reset behavior in a staging environment after you apply updates.

Conclusion​

CVE‑2025‑37854 is a real upstream Linux kernel defect in the drm/amdkfd (KFD) code that was fixed upstream in April 2025 and published as CVE‑2025‑37854 on May 9, 2025. The technical root cause is a use‑after‑free race during mode1 GPU reset recovery; the upstream patch forces safer ordering and waits for reset completion before freeing outstanding BOs.
Microsoft’s public wording that “Azure Linux includes this open‑source library and is therefore potentially affected” is accurate and authoritative for Azure Linux builds, but it is a product‑scoped inventory statement rather than a technical guarantee of exclusivity. Other Microsoft products or artifacts that ship a Linux kernel build compiled with the amdkfd code could, in principle, carry the same vulnerable code until Microsoft or the operator inventories and attests those artifacts as Not Affected or releases fixes for them. Security teams should act accordingly: patch Azure Linux instances now, inventory other Microsoft images in their environment, request VEX/CSAF attestations where necessary, and use the kernel inspection steps in this article to verify exposure in un‑attested artifacts.
If you manage GPU‑enabled infrastructure, treat this CVE as operationally relevant: it is an availability/stability risk that can have outsized impact in GPU‑dense, multi‑tenant, or long‑running compute environments. The upstream patch is available and distribution vendors have started shipping fixes — the immediate tasks are inventory, patching, and verification.


Source: MSRC Security Update Guide - Microsoft Security Response Center
 

Back
Top