The Linux kernel security record for CVE-2022-50303 closes a small but consequential race-and-error path in the AMD GPU stack: a double release of a compute PASID (process address space identifier) in the drm/amdkfd code that can produce deterministic kernel oopses and sustained denial-of-service on affected systems. The upstream remedy is narrowly scoped — a helper to ensure the VM’s PASID is set correctly at the last step of VM initialization — and distribution advisories and vulnerability trackers confirm the fix; operators should treat this as an availability-focused kernel robustness issue and apply vendor kernel updates or backports promptly.
Linux’s AMD GPU driver stack is split across multiple components: the amdgpu DRM driver, the Kernel Fusion Driver (KFD) used by compute stacks, and the DRM core. On systems that convert a graphics VM into a compute VM, the driver sometimes hands PASIDs over to compute code paths that manage those identifiers independently. A PASID identifies a process to the GPU for context and fault handling; mismanaging PASIDs risks freeing the same identifier twice or dereferencing driver state that has already been torn down.
CVE-2022-50303 describes exactly this scenario: when the helper kfd_process_device_init_vm fails after a VM has been converted to compute mode and vm->pasid was set to the compute PASID, KFD did not always take the expected reference on pdd->drm_file. That omission meant the DRM close file handler could run and release the PASID before the KFD process-destroy worker released it again — a double release that produced a WARNING backtrace and, in some cases, a NULL pointer dereference and kernel oops. The public advisories summarize the stack trace path and show the kernel frames that can appear during the fault (ida_free / amdgpu_pasid_free_delayed / amdgpu_driver_postclose_kms / drm_close_helper / drm_release / __fput → task_work_run / do_exit). The upstream corrective step was to add a small helper (amdgpu_amdkfd_gpuvm_set_vm_pasid) and call it at the final step of kfd_process_device_init_vm so that vm->pasid is left in a safe, consistent state whether VM acquisition succeeded or failed. That change prevents the simultaneous or redundant free of the same PASID by ensuring the driver ownership and reference-counting semantics are consistent across error and teardown paths. Multiple vendor trackers and vulnerability mirrors documented the same technical narrative and recommend installing kernel updates that include the upstream commit.
Immediate steps (ordered):
Post-patch verification:
The core takeaways for defenders are straightforward: inventory your AMD GPU exposure, install patched kernels (or rebuild with the stable commit), restrict /dev/dri exposure to trusted groups, and ensure kernel oops telemetry is captured and monitored. The fix itself is solid; the larger problem is the ecosystem gap — long-tail embedded images and misconfigured container/device passthrough policies remain the practical sources of continued exposure.
Conclusion: apply kernel updates that include the amdgpu/KFD commit for CVE‑2022‑50303 as soon as practical, prioritize multi-tenant and GPU‑exposed hosts, and harden device exposure for untrusted processes while you complete rollout and verification.
Source: MSRC Security Update Guide - Microsoft Security Response Center
Background / Overview
Linux’s AMD GPU driver stack is split across multiple components: the amdgpu DRM driver, the Kernel Fusion Driver (KFD) used by compute stacks, and the DRM core. On systems that convert a graphics VM into a compute VM, the driver sometimes hands PASIDs over to compute code paths that manage those identifiers independently. A PASID identifies a process to the GPU for context and fault handling; mismanaging PASIDs risks freeing the same identifier twice or dereferencing driver state that has already been torn down.CVE-2022-50303 describes exactly this scenario: when the helper kfd_process_device_init_vm fails after a VM has been converted to compute mode and vm->pasid was set to the compute PASID, KFD did not always take the expected reference on pdd->drm_file. That omission meant the DRM close file handler could run and release the PASID before the KFD process-destroy worker released it again — a double release that produced a WARNING backtrace and, in some cases, a NULL pointer dereference and kernel oops. The public advisories summarize the stack trace path and show the kernel frames that can appear during the fault (ida_free / amdgpu_pasid_free_delayed / amdgpu_driver_postclose_kms / drm_close_helper / drm_release / __fput → task_work_run / do_exit). The upstream corrective step was to add a small helper (amdgpu_amdkfd_gpuvm_set_vm_pasid) and call it at the final step of kfd_process_device_init_vm so that vm->pasid is left in a safe, consistent state whether VM acquisition succeeded or failed. That change prevents the simultaneous or redundant free of the same PASID by ensuring the driver ownership and reference-counting semantics are consistent across error and teardown paths. Multiple vendor trackers and vulnerability mirrors documented the same technical narrative and recommend installing kernel updates that include the upstream commit.
Technical anatomy — what actually went wrong
The PASID ownership contract
- PASIDs are allocated and recorded in the amdgpu VM manager (often via an ID allocator implementation).
- When a VM becomes a compute VM, the driver may set vm->pasid to a compute-managed value and adjust ownership and reference semantics.
- The amdgpu and KFD code paths must coordinate so only one subsystem is responsible for releasing that PASID and the associated bookkeeping.
Reproducer symptoms and kernel traces
Advisories include an example trace that operators will recognize:- Kernel log lines such as: “amdgpu: Failed to create process VM object” and “ida_free called for id=32770 which is not allocated.”
- Oops/stack frames rooted in ida_free, followed by amdgpu_pasid_free_delayed, amdgpu_driver_postclose_kms, drm_file_free, drm_close_helper, drm_release and core file descriptor teardown frames (fput/__fput/task_work_run/do_exit).
- The practical result: a deterministic driver crash and a host-level kernel oops in some configurations.
Who is affected
- Systems running Linux kernels whose upstream/stable trees predate the amdgpu/KFD fix and that have amdgpu / KFD enabled and loaded are in scope.
- Desktop and workstation machines that expose DRM devices (e.g., /dev/dri/*) to user processes or compositor helpers may be able to reach the ioctl and VM-init code paths from unprivileged contexts in some distributions.
- Multi-tenant servers, CI runners, or container platforms that purposely expose GPUs to less-trusted workloads (device passthrough, --device mounts) are higher-risk because a local unprivileged or containerized process could invoke the DRM/KFD paths that trigger the bug.
- The highest long-tail risk is vendor-supplied or embedded kernels (Android/SoC OEM trees, appliances) where backporting is slow or absent; those devices may remain vulnerable long after desktop distributions publish fixes.
Verification and cross-checks
Multiple independent, trusted sources document the issue and the corrective intent:- The NVD description matches the technical summary and provides the same stack trace indicators and error narrative.
- Distribution advisories (Ubuntu, SUSE) and package trackers list CVE-2022-50303 and describe the same fix — adding the helper and ensuring vm->pasid is updated atomically at the end of the init path.
- OSV / Debian and other vulnerability mirrors reference stable kernel commits and link into the kernel stable repository for the actual diffs; maintainers and distribution packagers are using those commits when producing vendor kernel updates. The upstream commit references and stable-tree merges are the authoritative patch artifacts.
Impact and exploitability
- Attack vector: local. The code paths are triggered from the DRM/KFD interactions during VM creation and teardown, which are invoked from kernel-space in response to user- or compositor-driven workloads. On many consumer systems, unprivileged processes can open DRM devices and reach relevant ioctl paths; on hardened servers, access may be restricted.
- Complexity: low-to-medium. The condition is deterministic in the specific error/race scenario described — a failed VM initialization after compute conversion with missing reference acquisition. A replicable DoS is feasible in environments where the ioctl or VM flow is reachable.
- Impact: high for availability on affected hosts. Kernel oopses, driver crashes, and host instability are the concrete outcomes: displays fail, compositors and user sessions drop, and in worst-case scenarios hosts require reboots. That is especially serious for multi-tenant cloud hosts or continuous-integration runners where uptime matters.
Detection and hunting
Monitor kernel logs and crash telemetry for the canonical signatures of the double-release path:- Search dmesg/journalctl for "ida_free called for id=" and for call traces that include amdgpu_pasid_free_delayed, amdgpu_driver_postclose_kms, drm_file_free, drm_close_helper, drm_release and __fput/task_work_run/do_exit. Those traces are the canonical indicator that the PASID allocator was freed incorrectly.
- Create SIEM/aggregator rules that flag:
- Kernel oops traces containing “amdgpu_pasid_free_delayed” or “ida_free”.
- Repeated DRM close or compositor crashes (Xwayland/Wayland) that correlate with GPU workloads.
- Sudden watchdog timeouts or host reboots that coincide with GPU-intensive workloads.
- Triage steps after detection:
- Preserve dmesg and serial-console logs immediately.
- Identify the process that invoked the ioctl (call traces often include the userland command, e.g., Xwayland or a compute runtime).
- Reproduce in an isolated test environment with the same kernel build if feasible to confirm whether the host is unpatched.
- If confirmed, prioritize rollback of untrusted device access and schedule a patch/reboot window.
Remediation and mitigations
The definitive remedy is to install vendor-supplied kernel packages or upstream stable-tree patches that include the amdgpu/KFD correction and then reboot into the updated kernel.Immediate steps (ordered):
- Inventory and exposure assessment:
- Find hosts that load AMD GPU modules: lsmod | grep amdgpu.
- List DRM device nodes and permissions: ls -l /dev/dri/*.
- Identify containers/VMs that mount /dev/dri or run with privileged GPU passthrough.
- Check vendor package trackers:
- Use your distribution’s security tracker or package changelog to confirm a kernel update addressing CVE-2022-50303 is available for your release. Ubuntu, SUSE, Debian/OSV entries list package states and advisories.
- Apply the kernel update and reboot:
- Install the fixed kernel package from your vendor’s repositories or build a custom kernel that cherry-picks the stable commit(s) referenced by the CVE.
- Reboot the host to activate the patched kernel and validate with representative GPU workloads.
- Compensating controls if patching is delayed:
- Restrict access to DRM device nodes (udev rules to bind to a trusted group).
- Remove /dev/dri from untrusted containers; avoid --device=/dev/dri entries.
- Harden container capabilities (drop CAP_SYS_ADMIN and other elevated capabilities) and block loading of kernel modules by untrusted users.
- Monitor kernel logs aggressively for the signature traces listed above.
Post-patch verification:
- Confirm the kernel package changelog includes the CVE or the upstream commit ID.
- Run the previously failing workload in a staging environment for 48–72 hours to ensure no repeat oopses occur.
- If a vendor-supplied detection script exists, prefer that rather than attempting to reproduce the oops with exploit code.
Critical analysis — strengths of the fix and residual risks
Strengths
- The upstream change is intentionally surgical: introducing a small helper and invoking it at the final step of VM initialization is a conservative, minimal-risk patch that preserves functional behavior for normal hardware while closing the error path.
- That surgical footprint makes the change easy for distributions and vendors to backport into stable kernels without extensive refactoring or regression risk.
- Multiple independent trackers and vendor advisories corroborate the same diagnosis and fix narrative, which increases confidence that the remediation is correct and widely deployable.
Residual and systemic risks
- Vendor lag remains the primary operational blind spot. Embedded devices, Android kernels, and OEM images frequently lag upstream and can remain vulnerable long after desktop distributions publish fixes. Those long-tail devices are the highest practical exposure.
- Misconfiguration exposure: systems that intentionally expose DRM device nodes to untrusted users, or run privileged containers with device passthrough, are much more exploitable than hardened servers. Attackers often seek such misconfigurations.
- Detection blind spots: kernel-level availability failures tend to be local and operational; organizations that do not capture kernel ring logs, serial console output, or retain oops traces may miss incidents.
- Attack chaining: while CVE-2022-50303 is a DoS primitive, availability failures can be used opportunistically in multi-stage attacks (e.g., to disrupt monitoring, force failovers, or mask other activity). Treat DoS-capable flaws with urgency if your environment is high-value or multi-tenant.
Practical checklist for administrators
- Inventory: locate all hosts with amdgpu modules loaded and note whether /dev/dri nodes are accessible to untrusted users or containers. (lsmod | grep amdgpu; ls -l /dev/dri).
- Patch: install your distribution’s kernel security update that references CVE‑2022‑50303 or the upstream commit and reboot.
- Restrict: add udev rules to bind /dev/dri to a trusted administrative group; remove device mounts from untrusted containers.
- Monitor: add SIEM rules for kernel traces with keywords amdgpu_pasid_free_delayed, ida_free, and drm_release.
- Vendor follow-up: for vendor-supplied devices, open support tickets and request backports where auto-updates are not available.
- Test: validate the fix in a pilot ring with GPU workloads for 48–72 hours before wide deployment.
- Document: capture pre- and post-patch kernel logs and changelog evidence showing the presence of the stable commit or CVE marker in kernel package metadata.
Final assessment and conclusion
CVE-2022-50303 is a classic kernel robustness fix: it eliminates a deterministic kernel crash stemming from a double free / double release of a PASID by ensuring ownership rules are applied consistently even when VM initialization fails. The upstream remedy is small and defensively coded — the kind of micro-patch kernel maintainers prefer because it minimizes regression risk and is straightforward to backport. Multiple independent vendors and vulnerability trackers document the same technical narrative and recommend upgrading to patched kernel packages. Operationally, treat this as a high-priority availability risk for multi-tenant infrastructure, CI runners, and systems that expose DRM device nodes to moderately trusted or untrusted workloads. For most desktop and workstation fleets, the fix is low-risk and should be rolled into routine kernel updates; for embedded and vendor-kernel fleets, escalate and demand a backport or updated image.The core takeaways for defenders are straightforward: inventory your AMD GPU exposure, install patched kernels (or rebuild with the stable commit), restrict /dev/dri exposure to trusted groups, and ensure kernel oops telemetry is captured and monitored. The fix itself is solid; the larger problem is the ecosystem gap — long-tail embedded images and misconfigured container/device passthrough policies remain the practical sources of continued exposure.
Conclusion: apply kernel updates that include the amdgpu/KFD commit for CVE‑2022‑50303 as soon as practical, prioritize multi-tenant and GPU‑exposed hosts, and harden device exposure for untrusted processes while you complete rollout and verification.
Source: MSRC Security Update Guide - Microsoft Security Response Center