A subtle race in the Linux kernel’s VFIO PCI interrupt handling was assigned CVE-2024-27437 after maintainers discovered that legacy INTx interrupts could be left permanently disabled for affected devices, causing a persistent availability failure for passthrough devices — the fix inverts the request/enable logic using IRQF_NO_AUTOEN so exclusive INTx lines are never auto‑enabled at registration.
VFIO (Virtual Function I/O) is the kernel framework that enables safe userspace drivers — most commonly used by QEMU/KVM for PCI device passthrough — to take direct control of physical hardware. VFIO exposes device regions and interrupt delivery to userspace while enforcing IOMMU isolation and other protections. For modern PCI devices, MSI and MSI‑X are the preferred interrupt mechanisms; legacy devices that still use INTx (the old shared hardware IRQ lines) require different handling because the interrupt masking semantics are performed at the irqchip level rather than by the device itself.
The CVE describes a logic and timing problem in VFIO’s INTx setup path for devices without DisINTx support (that is, devices that cannot themselves atomically disable legacy INTx delivery). Historically VFIO would register an IRQ handler with the kernel using the standard request_irq() path; request_irq() by default enables the IRQ during registration. VFIO then checked whether the device should be masked and, if so, disabled the IRQ. That small window — between request_irq()’s implicit enable and VFIO’s subsequent decision to mask — could allow a hardware interrupt to fire. If an interrupt ran at that moment, the kernel’s internal IRQ disable/enable depth counter for that IRQ could be incremented an extra time by the handler, and VFIO’s later mask/unmask logic could leave the line in a state where user requests to re‑enable it become ineffective. The result is a durable denial of service for that device until the interrupt state (or the host) is reset.
The vulnerability was triaged with a medium severity score by many vendors (CVSSv3 ~4.4) because the impact is availability rather than code execution or data compromise; however, the loss of device availability in virtualization and passthrough scenarios is operationally significant for hosts running GPU, NIC, or storage passthrough. Multiple vendor advisories and kernel stable backports applied the fix across distributions.
The vulnerable sequence looked like this in conceptual terms:
The patch adds IRQF_NO_AUTOEN for exclusive INTx registrations in the VFIO INTx setup path and then calls enable_irq() only when VFIO’s internal masked state indicates the line should be active. That eliminates the race: the IRQ cannot fire before VFIO finishes its internal setup and masking decisions. The upstream patch and stable backports carry the commit message and diffs.
Notable strengths:
Source: MSRC Security Update Guide - Microsoft Security Response Center
Background / Overview
VFIO (Virtual Function I/O) is the kernel framework that enables safe userspace drivers — most commonly used by QEMU/KVM for PCI device passthrough — to take direct control of physical hardware. VFIO exposes device regions and interrupt delivery to userspace while enforcing IOMMU isolation and other protections. For modern PCI devices, MSI and MSI‑X are the preferred interrupt mechanisms; legacy devices that still use INTx (the old shared hardware IRQ lines) require different handling because the interrupt masking semantics are performed at the irqchip level rather than by the device itself.The CVE describes a logic and timing problem in VFIO’s INTx setup path for devices without DisINTx support (that is, devices that cannot themselves atomically disable legacy INTx delivery). Historically VFIO would register an IRQ handler with the kernel using the standard request_irq() path; request_irq() by default enables the IRQ during registration. VFIO then checked whether the device should be masked and, if so, disabled the IRQ. That small window — between request_irq()’s implicit enable and VFIO’s subsequent decision to mask — could allow a hardware interrupt to fire. If an interrupt ran at that moment, the kernel’s internal IRQ disable/enable depth counter for that IRQ could be incremented an extra time by the handler, and VFIO’s later mask/unmask logic could leave the line in a state where user requests to re‑enable it become ineffective. The result is a durable denial of service for that device until the interrupt state (or the host) is reset.
The vulnerability was triaged with a medium severity score by many vendors (CVSSv3 ~4.4) because the impact is availability rather than code execution or data compromise; however, the loss of device availability in virtualization and passthrough scenarios is operationally significant for hosts running GPU, NIC, or storage passthrough. Multiple vendor advisories and kernel stable backports applied the fix across distributions.
What went wrong — a technical breakdown
INTx versus MSI/MSI‑X and DisINTx
- INTx are legacy PCI interrupt lines that are often shared among multiple devices and are masked/unmasked by the irqchip controller.
- MSI/MSI‑X are modern per‑device message signaled interrupts with clearer, per‑vector control and far reduced risk of this particular class of race.
- DisINTx is a PCI capability that allows a device to manage interrupt masking itself; devices that implement DisINTx relieve the irqchip of some responsibilities and are not subject to the same VFIO masking race.
The race and the disable depth
Linux’s IRQ core tracks a per‑IRQ disable depth (a nesting counter): each call to disable_irq()/disable_irq_nosync() increments depth; each enable_irq() decrements it. When depth transitions from 0→1, the irqchip disable path is invoked; when it transitions from 1→0 the irqchip is reenabled.The vulnerable sequence looked like this in conceptual terms:
- VFIO calls request_irq() to register its handler for the device’s INTx. By default request_irq() auto‑enables the IRQ line.
- VFIO then evaluates whether the VFIO context requires the IRQ to be masked (ctx->masked). If masked is true, VFIO calls disable_irq() to align the irqchip state with its masked flag.
- If an interrupt fires between (1) and (2), the handler itself may call disable_irq() (or the handler’s logic may cause the core to record disable) — increasing the disable depth once.
- When VFIO subsequently calls disable_irq() to reflect the masked state, the disable depth increments again.
- The resulting disable depth is now greater than expected (e.g., 2) and VFIO’s masked flag may prevent nested re‑enable attempts from user space, leaving the device’s IRQ line effectively stuck disabled even after userspace requests to unmask. The device loses the ability to receive interrupts, and passthrough I/O can fail indefinitely.
The fix: invert and control auto‑enable
To remove the window entirely, maintainers inverted the logic: request the IRQ without allowing request_irq() to auto‑enable it, then explicitly enable the line only when VFIO has determined that the device should be unmasked. The kernel provides the IRQF_NO_AUTOEN flag for request_irq()/devm_request_threaded_irq() to request registration without automatic enabling; this flag was designed precisely for drivers with teardown or race sensitivity where calling disable immediately after request is racy.The patch adds IRQF_NO_AUTOEN for exclusive INTx registrations in the VFIO INTx setup path and then calls enable_irq() only when VFIO’s internal masked state indicates the line should be active. That eliminates the race: the IRQ cannot fire before VFIO finishes its internal setup and masking decisions. The upstream patch and stable backports carry the commit message and diffs.
Scope and impact — who should care
- Primary affected group: Hosts that use VFIO PCI passthrough with devices that rely on legacy INTx interrupts and that do not implement DisINTx. This includes some older NICs, legacy GPUs, and certain embedded or vendor‑specific devices.
- Operational impact: A local operation (or a guest with direct device access) can trigger a persistent device unavailability condition. For production virtualization hosts that rely on passthrough devices for high‑availability workloads, this can be disruptive and may require a host reboot or device rebind to recover.
- Exploit prerequisites: The vulnerability is local (not remotely exploitable in the general sense) and requires the attacker to be in a position to control or trigger device interrupts — for example, a privileged user on the host, or a guest that has been granted direct VFIO access to a device. Vendor CVE metadata indicates high privileges are typically required for exploitation. Administrators should therefore treat it as a local availability risk in multi‑tenant or shared environments.
- Many modern devices and configurations prefer MSI/MSI‑X and are not using legacy INTx, which makes them unaffected.
- DisINTx‑capable devices do not require the same irqchip masking path and therefore generally avoid this race.
Detection, indicators, and operational symptoms
This class of failure typically shows up as device I/O failures, hung guests, or kernel log messages indicating the kernel has disabled an IRQ. Common operational indicators:- Guests show lost interrupt‑driven I/O (blank video on GPU passthrough, NICs stop receiving packets).
- dmesg or kernel logs contain messages like “Disabling IRQ #N” or other irq‑related warnings; Linux syslogs referencing IRQ disable/enable activity are a clue.
- Attempts to reconfigure or unmask the device from userspace appear to have no effect; enable_irq() returns but interrupts do not resume.
- In some cases, reloading the vfio_pci module or reattaching the device may temporarily clear the condition; full recovery sometimes requires a host reboot or re‑bind sequence because the irq_desc disable depth state lives in kernel core structures.
Mitigation and remediation
The definitive remediation is to run a kernel that includes the VFIO INTx patch (the commit titled “vfio/pci: Disable auto‑enable of exclusive INTx IRQ”) or an equivalent distribution backport. Vendors and distributors issued errata and kernel updates that include this fix.- Apply vendor kernel updates or distribution packages that include the patch. Major distributors published advisories and backports (for example, Amazon Linux, Oracle Linux, SUSE and others) and listed the kernel updates that remediate the issue. Prioritize hosts that run VFIO passthrough workloads.
- Prefer MSI/MSI‑X where possible. If the device and driver support it, using MSI/MSI‑X removes the reliance on legacy INTx and thereby removes exposure to this race.
- Avoid assigning INTx‑only devices to guests where untrusted user code could trigger interrupts. Where possible, postpone passthrough of legacy devices or require elevated assurance before enabling passthrough for tenants.
- On a host suffering from a stuck INTx line, an immediate but blunt recovery is to rebind the device (unbind from vfio_pci and rebind) or, if that fails, reboot the host. Note that these actions can themselves be disruptive to running workloads.
- Monitor kernel logs for IRQ disable messages and correlate with guest incidents. Logging and alerting on suspicious interrupt disable events will speed detection.
The fix in the tree — what maintainers changed
The patch, authored and signed off in upstream kernel discussions, makes a small but crucial change:- For devices that require exclusive INTx masking at the irqchip (devices lacking DisINTx support), VFIO now calls request_irq() with IRQF_NO_AUTOEN so that request_irq() does not auto‑enable the IRQ.
- VFIO then performs its locking and masked/unmasked decision. Only when VFIO determines the IRQ should be active does it call enable_irq() explicitly.
- This inverted logic eliminates the race window where an interrupt might fire before VFIO completes setup.
Practical recommendations — checklist for operators
- Inventory: Identify hosts running VFIO passthrough and enumerate devices that are using legacy INTx rather than MSI/MSI‑X. Prioritize systems with GPU, NIC or storage passthrough roles.
- Patch: Schedule application of distribution kernel updates that include the CVE fix. Use vendor advisory IDs (from your distro) to find the correct kernel packages.
- Test: Validate the kernel update in a staging environment that mirrors your production passthrough workloads. Confirm that devices attach, detach, and that guest performance is acceptable.
- Hardening: Where possible, migrate devices to MSI/MSI‑X or choose devices that implement DisINTx capabilities.
- Monitoring: Add alerts for kernel IRQ‑related messages and unexpected device unavailability events. Correlate with guest logs and device rebind sequences to detect incidents early.
- Recovery plan: Prepare a documented procedure for reattaching/rebinding devices or scheduling a host reboot in case an affected device becomes stuck in a disabled state.
Assessment — strengths, risks, and broader lessons
The patch itself is small and surgical: it uses an existing IRQ flag and inverts a short sequence of operations. That kind of minimal, well‑reasoned change is a strength — it avoids invasive redesign and fixes the root cause (the enable/mask race).Notable strengths:
- The fix uses a kernel‑sanctioned mechanism (IRQF_NO_AUTOEN) that is already used by other drivers to avoid similar races, which reduces the chance of regressions.
- The vulnerability does not permit remote code execution or data exfiltration; the impact is availability only, making the remediation focus relatively straightforward (patch kernels, reboot hosts if necessary).
- The change was reviewed and backported to stable trees quickly, and major distributors published updates, enabling operators to remediate in place.
- The attack surface is local and requires device control; but in multi‑tenant clouds or shared virtualization environments a guest with passthrough access can become an effective local attacker. Operators must therefore treat passthrough assignment as a trust boundary.
- Legacy devices and vendor‑specific embedded devices remain a recurring source of subtle kernel complexity; the prevalence of INTx devices in certain verticals (industrial, embedded) means the issue isn’t purely theoretical.
- Rebinding or rebooting to recover from an incident is disruptive; environments with strict uptime SLAs must weigh the operational cost and plan accordingly.
Final words
CVE‑2024‑27437 is a textbook example of a correctness‑and‑timing vulnerability that manifests as a high‑impact, low‑complexity denial of service for affected devices. The patch is concise and appropriate, and vendors have provided backports and advisories; the practical remediation for operators is straightforward: identify affected hosts and devices, apply vendor kernel updates, and prefer MSI/MSI‑X or devices with DisINTx where practical. Because the vulnerability affects device availability — a critical operational property in virtualized data centers and single‑tenant high‑performance systems — administrators should treat this as a priority patch for any host that uses VFIO passthrough.Source: MSRC Security Update Guide - Microsoft Security Response Center