Linux Kernel CVE-2025-68745: qla2xxx DMA Unmap Race Reverted and Fixed

  • Thread Author
The Linux kernel received a targeted fix for CVE-2025-68745 — a qla2xxx driver regression that caused SCSI target commands to become stuck after a chip reset and introduced dangerous race conditions around DMA unmapping — and maintainers responded by reverting the offending changes and applying a surgical correction to the abort path.

Neon cyberpunk server scene: a QLA2XXX HBA circuit card is cut by red scissors amid SCSI text.Background​

The qla2xxx driver family implements storage host bus adapter (HBA) support for QLogic Fibre Channel adapters in the mainline Linux kernel. Recent maintenance and feature work touched target-mode handling and DMA cleanup paths in that driver, and two upstream commits designed to improve offline-port/host-reset handling and to fix missed DMA unmaps inadvertently introduced two regressions: commands that were sent to firmware (FW) could be left “stuck” after a chip reset because FW would no longer respond, and a race in the DMA unmap path could trigger kernel BUGs when CPUs raced to unmap the same scatter-gather lists. The regression was tracked as CVE-2025-68745. The corrective action taken upstream was to revert the commit that changed DMA unmap behavior and to partially revert the host-reset/offline port handling commit in the specific abort-all-commands path (__qla2x00_abort_all_cmds), while also adding logic to explicitly clear commands after a chip reset so they cannot remain leaked waiting for a response that will never come.

Technical summary: what went wrong​

The two problematic commits​

Two commits are central to the failure mode:
  • An upstream commit identified by the patch message scsi: qla2xxx: target: Fix offline port handling and host reset handling altered how the driver handled host resets and port offline conditions. That change caused command entries that were submitted to firmware around a chip reset to become orphaned and never freed, because the firmware would not reply after reset and the driver’s cleanup path no longer guaranteed reclamation.
  • A subsequent commit described as scsi: qla2xxx: Fix missed DMA unmap for aborted commands tried to ensure aborted commands always had their DMA mappings unmapped. But the attempted fix created a new race: two CPUs could concurrently try to call the unmap helper (qlt_unmap_sg), producing a BUG_ON(!valid_dma_direction(dir) inside dma_unmap_sg_attrs under certain timings. That condition can trigger oopses and kernel crashes.
Together these interactions produced two concrete problems: memory and resource leaks that left qla2xxx commands indefinitely queued after a reset (an availability hazard), and unsafe DMA unmap races that could cause BUG_ON assertions and panic conditions (a system stability and safety hazard).

The fix applied upstream​

Rather than attempting more invasive rework of command lifecycle logic, the maintainers reverted the second commit that introduced the aggressive unmap behavior and partially reverted the first problematic commit in the very specific abort path (__qla2x00_abort_all_cmds). The upstream patch thus restores a safer cleanup sequence that:
  • Ensures commands that will never get replies after a chip reset are explicitly cleared and freed instead of left waiting for firmware that will not respond; and
  • Removes or tones down the concurrent unmap sequence that could race between CPUs, eliminating the dma_unmap_sg_attrs BUG_ON under the conditions reported.
The patchset author and reviewers described the change as a partial revert plus a focused insertion of cleanup logic; reviewers emphasized that this approach minimizes regression risk while addressing the two root causes simultaneously.

Operational impact and risk analysis​

Who is affected​

  • Systems that load the qla2xxx module and operate in SCSI target or initiator roles using QLogic Fibre Channel HBAs are the relevant population. That includes storage servers, SAN gateways, some hypervisor hosts, and appliances that expose FC-based block storage to guests. Vendors and distributions that ship kernels with qla2xxx compiled in or as a module must map the upstream change into their packages.
  • Windows-centric infrastructure teams should pay attention if they run Linux guests, containers, or specialized storage appliances alongside Windows infrastructure — a Linux kernel panic or stuck I/O in a guest or appliance can cascade into cross-platform service degradation (backup failures, replication stalls, or SAN-controlling host instability). This is a practical concern for mixed estates and hybrid data centers.

Severity and exploitability​

  • The public reporting and tracker entries indicate the impact is primarily availability and stability: leaked commands can lead to hung I/O and degraded throughput, and the DMA unmap race can cause kernel assertions and panics. Public sources classify the issue in the medium-to-high severity band based on CVSS v3-like vectors (one public tracker lists a v3 score of 7.1 with a local attack vector requiring privileges or low-complexity local conditions).
  • There is no reliable public proof-of-concept that elevates this to a remote code execution or straightforward privilege-escalation vulnerability at the time of disclosure; the exploitability requires local or privileged access plus the ability to drive FC I/O or provoke the chip reset/abort sequences under specific timing windows. Treat remote RCE assertions as unverified until a reproducible exploit chain is published.

Why this matters operationally​

  • A stuck command pool or repeated kernel BUG_ON conditions in a storage driver can force host reboots or leave services blocked; for storage backends this can cascade to application outages affecting Windows services that rely on those Linux-hosted storage components. For virtualization hosts, a kernel panic in a host that provides virtualized storage can affect many tenants. Practical operational impact is real even if the vulnerability is “only” local in vector.

Detection and triage — what to look for​

Quick detection is straightforward if you know the signatures to hunt for:
  • Kernel log patterns: search dmesg/journalctl -k for BUG_ON, BUG, dma_unmap_sg_attrs, BUG_ON(!valid_dma_direction(dir), or backtraces mentioning qlt_free_cmd / qlt_unmap_sg. These indicate the unmap race or assertion triggers.
  • Hung I/O symptoms: long-running SCSI commands, iSCSI or FC target timeouts, or persistent waits in SCSI target work queues after an HBA reset. Look for kernel messages around chip reset events where commands are flushed or not completed.
  • Correlate with HBAs: lspci to identify QLogic devices; lsmod | grep qla2xxx and modinfo qla2xxx to verify module presence and version. Match running kernel package to vendor advisories or kernel changelogs to confirm whether the affected upstream commit made it into your distribution’s build.
Detection commands (examples):
  • journalctl -k | egrep -i 'qlt_free_cmd|qlt_unmap_sg|dma_unmap_sg_attrs|BUG_ON'
  • dmesg | tail -n 200 | egrep -i 'qla2xxx|qla|BUG_ON|BUG:'
  • uname -r; modinfo qla2xxx
If you observe kernel oopses or explicit BUG_ON traces referencing DMA unmap helpers, treat the host as high priority for remediation.

Remediation and mitigation​

Definitive fix​

  • Install a kernel package from your distribution that contains the upstream reversion/patch addressing CVE-2025-68745. Upstream maintainers reverted the problematic unmap commit and partially reverted the host-reset commit in the abort path; distributions will map those stable commits into their kernel packages and security updates. Confirm vendor advisories or distribution trackers (Debian, Ubuntu, RHEL-family, SUSE, etc. for the exact package name and version that includes the fix.

Short-term mitigations (if patching is not immediately possible)​

  • If the qla2xxx driver is not required on a host, unload or blacklist the module to eliminate exposure: rmmod qla2xxx (only when safe and when no device is actively using the driver) and add a blacklist file in /etc/modprobe.d/ to prevent future auto-loading. Validate that no production services rely on that driver before doing so.
  • Reduce the attack surface: restrict who can run local jobs that can trigger heavy Fibre Channel I/O (cron jobs or tenants that initiate SAN operations). In multi-tenant environments, consider isolating I/O-heavy guests onto patched hosts until you complete a full rollout.
  • Maintain heightened logging and monitoring around expected maintenance windows when chip resets or HBA resets occur; collect dmesg and vmcore (kdump) if a crash occurs for forensic analysis and vendor support.

Recommended patching playbook (prioritized steps)​

  • Inventory: identify all hosts that load qla2xxx (use lsmod, modinfo, lspci, and CMDB queries).
  • Cross-check: map each host’s kernel package to distribution security advisories for CVE-2025-68745 and confirm the upstream stable commit IDs are included.
  • Canary: apply the vendor-provided kernel update to a small canary group that mirrors production workloads and SAN topology. Run SAN validation, I/O soak tests, and simulate HBA resets where possible.
  • Rollout: update remaining hosts in staged windows, monitoring kernel logs and SAN controller telemetry for anomalous behavior.
  • Post-rollout validation: run targeted SCSI/FC tests and monitor for absent hangs or BUG traces for several days to ensure stability.

Why the upstream approach is pragmatic — and what to watch for​

The maintainers chose a minimal-change path: revert the risky concurrency change and partially revert the host-reset logic only where cleanup occurs, while adding explicit command-clearing logic after chip reset. This is pragmatic for several reasons:
  • Small, surgical changes are easier to reason about, test, and backport across stable kernel branches; they also reduce regression risk relative to a broad rework of command lifecycle.
  • The root causes were lifecycle/race-condition problems that are notoriously subtle; the partial reversion limits the impact area while restoring correct invariants in the abort path and DMA cleanup.
However, operators should remain cautious:
  • Vendor/backport lag: enterprise distributions and OEM kernels apply backports on different schedules. A patched upstream does not guarantee the vendor kernel your host runs is immediately fixed. Confirm vendor advisories for package-level backporting and timelines.
  • Residual risks: race conditions tend to be contextual. The applied fix addresses the documented paths; other, unrelated timing windows could still be present and will require ongoing monitoring and testing in complex SAN topologies.

Practical checklist for WindowsForum readers (concise)​

  • Inventory: find hosts using qla2xxx and classify them by criticality (storage controllers, hypervisors, SAN gateways).
  • Patch mapping: identify the patched kernel package from your distribution vendor’s security advisory and plan rollout.
  • Canary: test updates in a small set of hosts mirroring production SAN fabric and workload patterns.
  • Monitor: centralize kernel logs, look for BUG_ON and DMA unmap traces, and collect crash dumps for vendor triage if needed.
  • Mitigate if needed: unload/blacklist qla2xxx on non-essential hosts or isolate untrusted tenants from FC resources pending updates.

Final assessment and recommended timeline​

CVE-2025-68745 is a stability-focused kernel vulnerability that arose from recent qla2xxx changes intended to handle offline ports and to tighten DMA unmap coverage. The upstream response — reverting the risky unmap change and partially reverting the offending host-reset semantics while explicitly clearing commands after chip reset — is appropriate for the failure class and minimizes regression surface. Distributions and vendors will soon (or already have) mapped upstream commits into security updates; applying these vendor updates is the sensible remediation path. For Windows-centric infrastructure teams that operate mixed estates, the operational priority is to identify Linux guests, virtual appliances, or storage nodes that present a dependency on QLogic HBAs and to schedule patch windows accordingly. Where livepatching or vendor images with integrated backports are available, pursue those mechanisms; otherwise, stage full kernel updates through your standard maintenance and test cycles.
Be cautious about any claim that this issue yields immediate remote code execution; current public material attributes the impact to availability and kernel assertions, not an easy remote privilege-escalation vector. Continue to monitor vendor advisories for any escalation in CVSS, EPSS, or published exploit proof-of-concept artifacts; if such artifacts appear, elevate remediation urgency accordingly.
CVE-2025-68745 is a reminder that low-level driver changes that touch teardown, DMA unmap, and reset semantics can have outsized operational consequences. The best defense for mixed Windows/Linux estates is an inventory-driven patch program, conservative canary testing, and active kernel- and SAN-level monitoring so that rollouts are both safe and rapid.

Source: MSRC Security Update Guide - Microsoft Security Response Center
 

Back
Top