Linux Kernel Patch Fixes AMD MPC Array Bounds CVE-2024-26914

  • Thread Author
The Linux kernel received a targeted fix for a subtle but potentially disruptive array‑bounds error in the AMD DRM display code: an incorrect constant was used to size the mpc_combine array, allowing an out‑of‑bounds write when a GPU had more planes than the per‑stream limit implied. The issue, tracked as CVE‑2024‑26914, was corrected by replacing MAX_SURFACES with the hardware‑wide MAX_PLANES for the mpc_combine allocation and has been included in stable kernel trees and distribution security advisories.

Background​

What the MPC block does​

The Multiple Pipe/Plane Combined (MPC) block inside AMD display hardware blends multiple pixel planes (surfaces) into a composited output, applying per‑plane alpha, blending modes, and post‑blend color corrections. Kernel documentation and driver code consistently treat MPC/MPC Combine as a per‑ASIC capability that maps to the number of hardware planes the silicon can support. The kernel’s AMDGPU display manager and related headers expose and document concepts such as MPCC (MPCC blending configuration), MAX_SURFACES, and MAX_PLANES.

Terminology: MAX_SURFACES vs MAX_PLANES​

  • MAX_SURFACES is intended as an upper bound for the number of surfaces that can be piped to a single CRTC/stream — a per‑stream figure.
  • MAX_PLANES is an upper bound that describes how many hardware planes the ASIC supports overall — a per‑ASIC figure.
The distinction is crucial: a single CRTC/stream may be limited to only a subset of the total planes an ASIC exposes. Misusing the per‑stream constant when allocating storage intended to cover all planes across the ASIC creates an out‑of‑bounds risk when a real board / topology exposes more planes than the per‑stream limit.

What went wrong — the technical root cause​

The vulnerable code allocated the array named mpc_combine using MAX_SURFACES as the dimension. In practice, mpc_combine is used to record or index all planes per ASIC when programming the MPC combine hardware. Because MAX_SURFACES is smaller (it is defined per stream) using it as the allocation bound is incorrect; when a device or topology exposes more planes than that per‑stream value (for example, when multiple planes exist on the ASIC), the driver can write past the end of the mpc_combine array, producing a classic array overflow. The fix is simple and surgical: use MAX_PLANES as the array size so the allocation matches the hardware‑wide plane count. Why this is more than a pedantic code nit:
  • Kernel out‑of‑bounds writes in driver code run in privileged context can cause memory corruption, kernel oopses, crashes, or worse depending on the layout and surrounding code.
  • Hardware topologies for modern GPUs are not uniform — vendor code must treat per‑ASIC constants differently from per‑stream limits. A mismatch often only surfaces on particular boards or configurations, making the bug rare but real in the field.
  • Small arithmetic or indexing mistakes like this are easy to miss in code review but reliably expose a local denial‑of‑service vector where an attacker with local access (or an untrusted tenant) can provoke crashes.
The upstream correction replaces the incorrect bound with the hardware‑wide constant; the patch was merged into the stable kernel trees. OSV and distribution trackers list the stable commits and affected git ranges.

Affected versions and patch status​

Multiple vulnerability trackers and major distributions have recorded the issue, its fix and the kernel commits that remediate it.
  • NVD / OSV list CVE‑2024‑26914 with a short technical description and reference to the upstream stable commits that implement the fix. OSV shows the commit hashes associated with the remediation and the git events that introduced and fixed the problem.
  • Ubuntu and SUSE both published CVE entries and mapping information for affected kernels and the distribution package statuses; Ubuntu’s advisory lists the CVSS 3.1 score used by that tracker. Amazon’s ALAS and other vendor feeds also indexed the CVE and published their assessment of affected package sets. These distribution advisories indicate the fix has been incorporated into vendor packages where appropriate.
Observed vendor CVSS variants:
  • Some trackers (Ubuntu, Enginsight/NVD-derived entries) report a higher CVSSv3 base — Ubuntu lists 7.8 (High) — but other vendor feeds report lower numeric values (for example, Amazon’s listing uses 5.5). Differences reflect distinct scoring assumptions (impact vectors, whether confidentiality/integrity loss is plausible), and operators should prioritize by exposure and context rather than a single numeric score.

Exploitability and practical impact​

Attack vector and prerequisites​

  • Attack vector: Local. The problematic path is in the DRM AMD display driver; exploiting it requires code that exercises the display programming path in a way that triggers the out‑of‑bounds write. In many setups, such operations are accessible to userland graphics stacks (compositors, GPU test utilities) or to unprivileged processes that have access to DRM device nodes.
  • Privileges required: Typically low — in some desktop or embedded configurations, unprivileged processes can trigger the relevant ioctls indirectly (for example via compositors or sandboxed GPU runtimes). For hardened servers, access to /dev/dri is commonly restricted, raising the bar.
  • Complexity: Low to moderate. The error is a deterministic indexing/overflow bug when the driver handles more planes than the smaller constant expects, given the right hardware topology.

Realistic impact​

  • Primary impact: Availability (DoS). Kernel memory corruption or oops caused by an out‑of‑bounds write can crash the driver or the kernel and force reboots or long‑running instability.
  • Secondary impact: Information disclosure or privilege elevation is theoretically possible only in complex exploitation chains; the immediate, practical result is instability or a crash. No confirmed reliable remote code‑execution exploits tied to CVE‑2024‑26914 were published in public advisories at the time distributions listed the fix, and vendors treat the issue primarily as an availability risk. Operators should treat claims of RCE as unverified unless a reliable proof‑of‑concept is available.

Who should prioritize patching​

  • Desktop and workstation fleets that expose GPU device nodes to untrusted users or run unprivileged workloads which interact with DRM.
  • Multi‑tenant hosts or CI/CD runners where untrusted guests or containers might gain access to GPU devices.
  • Embedded platforms and vendor images (IoT, kiosks, appliances) that embed AMD GPUs and often have longer update cycles; these environments can remain vulnerable long after desktop distributions ship fixes. Distribution and vendor advisories repeatedly note the long‑tail risk for embedded kernels.

Detection, indicators, and triage​

Logs and observable signals​

  • Kernel oops or panic messages referencing the AMD DRM driver (amdgpu), MPCC/MPC configuration, or stack traces in display programming paths are the canonical indicators that this class of error was hit.
  • Repeated compositor crashes, pageflip timeouts, black screens or graphical corruption consistent with driver instability point to a display driver problem that warrants immediate log capture. User complaints of frequent X/Wayland crashes, or watchdog timeouts during GPU‑heavy workloads, should be triaged against kernel logs.

Steps to triage a suspected incident​

  • Preserve kernel logs: save dmesg and serial console output immediately.
  • Identify the process that triggered the ioctl (call traces in an oops usually include the userland process, e.g., Xwayland or a compositor).
  • Confirm the running kernel version and whether the distribution kernel package includes the upstream fix (check package changelogs or the vendor’s CVE tracker).
  • Reproduce in an isolated, instrumented lab with the same kernel if possible; avoid running PoC exploit code in production.
Guidance compiled by incident response teams and published operational notes recommend restricting untrusted access to /dev/dri, avoiding device passthrough to untrusted containers, and collecting full oops traces for correlation with upstream commits. These are practical, low‑risk compensations while patches are obtained.

Remediation and mitigations​

Recommended immediate actions​

  • Identify affected hosts: run uname -r and inspect whether the AMDGPU/DRM modules are loaded (lsmod | grep amdgpu). Check whether /dev/dri devices are accessible to untrusted users or bound into containers.
  • Apply vendor/distribution kernel security updates that list CVE‑2024‑26914 or that include the upstream stable commits. Rebooting into the patched kernel is required to activate the fix.
  • If vendor packages are not yet available for particular embedded devices, contact the vendor for an ETA on a firmware/kernel image that includes the commit, or rebuild the kernel with the upstream stable commit cherry‑picked into the device kernel tree and validate on representative hardware.

Temporary compensations if immediate patching is impossible​

  • Restrict access to DRM devices: adjust udev rules to limit /dev/dri/* to a trusted group, remove device nodes from untrusted containers, and block untrusted processes from loading or interacting with DRM drivers.
  • Isolate high‑value hosts from untrusted users and avoid allowing non‑admin users to run GPU stress or driver test tools.
  • Maintain heightened monitoring of kernel logs and implement SIEM rules to alert on amdgpu / drm oops patterns. Operational guidance developed for similar kernel display vulnerabilities recommends these compensations while updates are staged.

Backporting advice for vendors and integrators​

  • The upstream fix is intentionally small; kernel maintainers favored a surgical change to the allocation size rather than invasive restructuring. That makes the patch suitable for safe backports into stable branches. Vendors building device images should cherry‑pick the stable commit and perform hardware smoke tests across their supported board topologies. Distribution changelogs should be consulted to verify the presence of the exact stable commit ID before declaring systems remediated.

Critical analysis — strengths of the fix and residual risks​

Strengths​

  • The upstream fix is surgical and minimal: changing the array dimension constant to the correct per‑ASIC bound is low risk and straightforward to backport, which reduces regression potential and eases distribution adoption.
  • Multiple independent trackers and distribution advisories have indexed the CVE and mapped it to the upstream stable commits, so operators have clear artifacts (commit IDs, package changelogs) to verify remediations.

Residual risks​

  • Vendor lag and the embedded long tail: OEM/embedded kernels often lag upstream; devices in the field (appliances, thin clients, set‑top boxes) might not receive timely updates and remain vulnerable. This is the dominant operational blind spot for kernel driver issues.
  • Exposure by configuration: Systems that deliberately expose DRM devices to untrusted containers or users (CI runners, multi‑tenant GPU hosts) remain high risk unless mitigations are enforced. Administrators should reassess device passthrough policies accordingly.
  • Detection blind spots: Kernel oopses are local, operational artifacts; environments that don’t preserve kernel ring buffers, serial logs, or crash dumps may fail to detect and triage repeated exploitation attempts.

On severity scores and prioritization​

Public feeds show a split in numeric severity values (CVSS 3.x scores range in published entries). This divergence underlines a critical operational point: numeric CVSS is only one triage input. Prioritize by exposure (is /dev/dri accessible to untrusted actors?, service criticality (multi‑tenant hosts or production VDI), and device lifecycle (embedded fleet updateability), rather than a single absolute CVSS number.

Practical checklist for administrators (actionable)​

  • Inventory: Find machines with AMD GPU drivers loaded: uname -r; lsmod | grep amdgpu; list /dev/dri permissions.
  • Confirm: Check vendor/distribution security tracker or kernel package changelog for CVE‑2024‑26914 or the upstream commit IDs before upgrading.
  • Patch: Install vendor kernel updates that include the fix; reboot to the new kernel.
  • Validate: Run representative display workloads and compositor tests for 24–72 hours to ensure no regressions and no repeat oops.
  • Compensate: Until patched, restrict /dev/dri access to trusted groups and avoid device passthrough into untrusted containers.

Conclusion​

CVE‑2024‑26914 is a compact example of how a single incorrect constant — using a per‑stream limit where a per‑ASIC bound was required — can produce a real kernel memory corruption vulnerability in a widely used graphics driver. The upstream patch corrects the allocation by using MAX_PLANES for the mpc_combine array, closing the overflow window. Distribution and vendor advisories have indexed the fix and released updates; operators should treat this primarily as an availability risk and prioritize remediation according to exposure, device lifecycle, and multi‑tenant risk. The fix is low‑risk and straightforward to backport, but the persistent operational concern remains the long tail of vendor kernels and embedded devices that may not receive timely updates — those fleets must be inventoried and remediated with higher urgency.
Source: MSRC Security Update Guide - Microsoft Security Response Center