Linux Kernel CVE-2025-21907 Fix: Unmap Poisoned Folio TTU Flag Update

  • Thread Author
A glowing TTU shield over a processor board, with Linux penguin and a green checkmark.
The Linux kernel fix for CVE-2025-21907 closes a subtle but real correctness window in memory‑failure handling: the kernel now updates the TTU (try_to_unmap) flag inside unmap_poisoned_folio to ensure poisoned folios are consistently marked during unmap/migration operations, preventing spurious WARN/OOPS traces and potential denial‑of‑service in systems that exercise the affected reclaim/migration paths.

Background / Overview​

This vulnerability is a kernel memory‑management correctness bug described as: mm: memory‑failure: update ttu flag inside unmap_poisoned_folio. It was assigned CVE‑2025‑21907 and documented across major vulnerability databases and vendor advisories. The root cause is a timing/order inconsistency when the kernel handles hwpoisoned folios (pages marked after a hardware memory error) while performing page migration or unmap operations; the fix moves a TTU (try_to_unmap) behavior into unmap_poisoned_folio to ensure folio state and signal semantics remain correct during those sequences. Why this matters operationally: the bug is availability‑centric. When the window is hit, kernel diagnostics show a WARN or BUG_ON that often ends in an OOPS or panic depending on configuration and runtime conditions. That behavior can disrupt services and, in multi‑tenant or cloud contexts, produce outsized impact because a single host crash affects many consumers. Several vendor trackers flagged this as an availability issue with differing CVSS scoring, but the operational reality is the risk is highest for hosts that run untrusted work or exercise advanced memory reclaim/migration flows.

Technical anatomy: what goes wrong​

The actors: folios, hwpoison and TTU​

  • A folio is a unit of page cache / memory management in modern kernels (a group of pages treated as one unit).
  • hwpoison marks folios that hardware or kernel error handling has quarantined because of an ECC or memory error.
  • TTU flags (try‑to‑unmap related flags) and helpers manage unmap semantics during migration, reclaim or user‑mapping removal sequences.
The bug appears in a specific migration/unmap code path: when a folio becomes hwpoisoned during reclaim or migration, the kernel must unmap user mappings and set appropriate TTU semantics so that later accesses produce SIGBUS (or otherwise behave as designed for error pages). The affected code path failed to consistently set a TTU indicator during the unmap_*/migration flows for anonymous folios and pagecache folios, leaving a window where mappings, page‑table state, and VMA flags were inconsistent. That inconsistency can trigger kernel diagnostic checks (WARN/BUG) that lead to OOPS/panic.

Reproducer signals seen in logs​

When the condition is hit, typical kernel logs (dmesg / journalctl -k) contain a stack trace showing attempts to unmap or migrate pages and a top‑level diagnostic like:
  • WARNING: CPU: ... at mm/rmap.c:try_to_unmap_one+...
  • Backtrace frames that include unmap_poisoned_folio, do_migrate_range, offline_pages, and reclaim/shrinker call stacks.
These traces were included verbatim in public advisories and are useful hunting indicators. Vendors and the NVD included the sample WARN trace as evidence of the failure mode.

What the upstream patch changes​

The upstream patch series (“mm: memory_failure: unmap poisoned folio during migrate properly”, v3) takes a surgical approach:
  • Introduce/convert TTU_HWPOISON to replace the older TTU_IGNORE_HWPOISON semantics in order to centralize the decision to treat poison‑marked folios in unmap/migration logic.
  • Move the policy that sets the TTU behavior from hwpoison_user_mappings into unmap_poisoned_folio so that both anonymous folios and pagecache folios receive consistent TTU updates during unmap and migration.
  • Remove a shadowed local variable and tidy call sites so the unmap flow consistently applies the flag and avoids emitting the WARN/OOPS trace.
The change is intentionally small and local to the mm/hwpoison and rmap/unmap code paths; it preserves higher‑level semantics while closing the race window that produced inconsistent state. The commit commonly cited in advisories is 6da6b1d4a7df (the commit that converted the TTU handling), and vendors note the patch was included in stable kernel trees and vendor backports.

Impact and exploitability​

Practical impact​

  • Primary effect: availability loss (kernel WARN/OOPS/panic) — operations that hit the timing window can cause an OOPS that may crash or hang the host.
  • Secondary: theoretical risk of memory corruption chains leading to escalation is not supported by authoritative public evidence for this CVE at disclosure; vendors emphasize an availability‑first profile. Nevertheless, any kernel UAF or mishandled page flags should be treated seriously because complex exploit chains sometimes exist, though none were publicly demonstrated for this issue.

Attack model and who should worry most​

  • Local or host‑adjacent only. An attacker must be able to run code on the host or inside a privileged guest that can exercise the affected reclaim/migration flows (for example by triggering offline/memory block actions, or by inducing hwpoison handling).
  • High‑priority environments: multi‑tenant cloud hosts, VM hypervisors, CI runners, container hosts and any infrastructure processing untrusted images or workloads that can manipulate memory semantics.
  • Lower‑risk: single‑user desktops rarely exercise the same complex reclaim/migration sequences required to reproduce the issue, unless specialized tools (userfaultfd, MINOR modes, KSM) are in use.
Vendors and trackers recorded variant CVSS scores (NVD’s vector emphasized an availability impact; some distributors recorded a lower numerical score reflecting local attack preconditions). Use operational exposure — not only numeric CVSS — to prioritize remediation.

Detection, telemetry and incident response​

Hunting signals (what to search for)​

  • Kernel logs that contain the exact WARN/OOPS frames: unmap_poisoned_folio, try_to_unmap_one, do_migrate_range, offline_pages, VM_BUG_ON_FOLIO text or similar rmap/migration traces.
  • Repeated OOPS/panic events correlated with processes that perform memory‑offlining, device_offline sysfs writes, or manual memory block offlines.
  • SIEM alerts for kernel OOPS frequency spikes on hosts that perform heavy page migration or that accept untrusted workloads.
Add these grep patterns into log collection rules (dmesg/journalctl‑k) and create high‑severity alerts for hosts in multi‑tenant or production roles when such traces appear. Preserve crash dumps (kdump) and dmesg before reboot — the stack trace is often the only forensic artifact.

Example tactical detection steps​

  1. Run: sudo journalctl -k | grep -E 'unmap_poisoned_folio|try_to_unmap_one|do_migrate_range'
  2. If matched, collect full dmesg and kdump artifacts immediately:
    • sudo dmesg > /tmp/dmesg-cve-2025-21907.log
    • Save /var/crash or kdump images for post‑mortem.
  3. Isolate the host workload and schedule a reboot into a patched kernel once available.

Remediation and mitigation guidance​

The authoritative remediation is to install a kernel package that contains the upstream fix and reboot into the patched kernel. Kernel fixes require a reboot to activate. Vendors and distributions published advisories that map upstream commits to package versions; check your distro’s security tracker for the exact fixed package for your kernel line.

Immediate prioritized runbook (practical)​

  1. Inventory: enumerate kernel versions and identify at‑risk hosts.
    • uname -r
    • zgrep CONFIG_KSM /boot/config-$(uname -r) (if KSM is relevant)
    • Identify hosts that accept untrusted images or run untrusted tenant workloads.
  2. Consult your vendor advisory pages (Debian, Ubuntu, RHEL, SUSE, Amazon Linux, etc. and map the upstream commit/patch to a kernel package version.
    • Confirm the kernel changelog or package metadata references CVE‑2025‑21907 or the upstream commit.
  3. Stage and deploy: apply vendor kernel updates in a test/pilot ring that mimics production memory pressure and migration activity.
  4. Reboot hosts into the patched kernel in a rolling or scheduled maintenance window.
  5. Post‑deployment: monitor kernel logs for residual OOPS or WARN traces.
If you cannot patch immediately:
  • Restrict who can run untrusted workloads or mount untrusted images on critical hosts.
  • Move untrusted jobs off hypervisors or shared runners until patched.
  • Increase logging/alerting for the diagnostic traces described above and be prepared to remediate or reboot hosts that show evidence of the failure.

Distribution/backport status snapshot​

Multiple OS trackers and vendor advisories included the CVE entry and mapped fixes into stable kernel releases and vendor backports. Distribution pages (Ubuntu, Debian) and AWS advisory pages list the CVE and the remediation status relevant to each release; Amazon’s ALAS listing and other trackers show tabulated statuses per platform. Operators must confirm whether their vendor’s packaged kernel includes the upstream stable commit(s) before marking a host as patched.

Critical analysis: strengths of the response and residual risks​

Strengths​

  • The upstream fix is surgical and low‑risk: it corrects a consistency invariant rather than reworking large subsystems. That makes vendor backports simpler and lowers regression probability.
  • The public disclosure included concrete WARN/OOPS traces and commit references, enabling administrators to hunt and validate patches.
  • Multiple independent trackers (NVD, Debian, Ubuntu, AWS ALAS, Red Hat trackers) recorded and mirrored the same technical summary, improving confidence in the diagnosis and the fix path.

Residual caveats and operational risks​

  • Vendor/backport lag: not all vendors backport fixes at the same cadence. Some long‑lifecycle kernels may remain unpatched, or vendors may choose different backport approaches. Do not assume a kernel is fixed until the distribution/changelog explicitly lists the commit or CVE.
  • Local vector: while attack complexity is non‑trivial (requires local code execution or guest privileges), in multi‑tenant and CI environments that allow untrusted code the exposure is operationally significant.
  • Over-reliance on CVSS: numerical scores vary between trackers; prioritize remediation by exposure (multi‑tenant clouds > single desktops) rather than a single number.

Recommended checklist for sysadmins and cloud operators​

  • Inventory and triage:
      1. Run uname -r across the estate and map to vendor advisories that mention CVE‑2025‑21907.
      1. Flag multi‑tenant hosts, hypervisors, CI runners and any nodes that mount or process untrusted images.
  • Patch and reboot:
      1. Obtain vendor kernel packages that list the CVE or upstream commit and schedule reboots.
      1. Stage updates in a pilot ring that exercises heavy reclaim/migration flows (simulate memory pressure).
  • Short‑term compensations when patching is delayed:
      1. Prevent untrusted users/processes from mounting images or creating workloads on at‑risk hosts.
      1. Isolate image processing to dedicated, non‑critical hosts that can be patched or reprovisioned easily.
  • Post‑patch verification:
      1. Confirm the running kernel version and package changelog reference the fix.
      1. Monitor kernel logs for OOPS/WARN traces post deployment.

Broader context and what defenders often get wrong​

  • Treat kernel availability bugs seriously even if exploitability requires local actions: in cloud contexts local equals dangerous because tenants share physical hosts.
  • Do not assume the absence of public exploit code means the bug is harmless — syzkaller and CI reproductions show the condition is reachable and can cause host outages.
  • When in doubt about whether a distribution package contains the upstream fix, inspect the package changelog or contact your vendor; do not rely solely on automated scanner output without verifying the package mapping.

Conclusion​

CVE‑2025‑21907 is a focused but important Linux kernel correctness fix: moving the TTU policy into unmap_poisoned_folio eliminates a timing/invariant failure that caused WARN/OOPS traces and could produce host availability loss in environments that exercise page migration and hwpoison handling. The upstream approach is appropriately surgical, making vendor backports practical; however, operators must treat the issue as a high priority for multi‑tenant hosts and systems that run untrusted workloads. The recommended course is straightforward and decisive: identify affected hosts, apply vendor kernel updates that include the upstream commits, and reboot into the patched kernel while increasing detection and isolation protections until the patch is deployed.
Source: MSRC Security Update Guide - Microsoft Security Response Center
 

Back
Top