Linux Kernel KSM Scan Fix: Range-Walk Patch Cuts DoS CPU Spikes

  • Thread Author
The Linux kernel received a targeted correction that removes a surprising—and in some workloads, catastrophic—inefficiency in KSM scanning: scan_get_next_rmap_item now uses a range-walk helper to skip large unmapped holes instead of walking every address, fixing a case where ksmd could burn CPU for hours scanning terabytes of empty virtual address space.

Tux the penguin beside a glowing “Range Walk” memory diagram.Background​

Kernel Samepage Merging (KSM) is a longstanding Linux memory-management feature that searches for identical anonymous pages across processes and consolidates them into a single copy-on-write page to save physical memory. Ksm’s userland daemon, ksmd, periodically scans memory mappings (VMAs) to find candidate pages for deduplication. The routine at the center of this change, scan_get_next_rmap_item, was responsible for walking page addresses in a VMA to locate mergeable pages. That per-address walk behaved acceptably in the common case of densely populated VMAs, but it becomes catastrophically inefficient when a VMA is extremely sparse—large address ranges with very few actual pages mapped. This problem was highlighted by a simple, deterministic reproduction: a process creates a huge mmap (for example, 32 TiB using MAP_NORESERVE) and touches a single page. When KSM is instructed to scan that mapping (madvise with MADV_MERGEABLE), legacy scan_get_next_rmap_item would iterate every page address in the 32 TiB range rather than jumping over the unmapped regions. The result: ksmd consumed 100% CPU for prolonged periods while deduplicating effectively nothing. The kernel community responded with a surgical patch that replaces the per-address lookups with a range walker driven by walk_page_range, allowing ksmd to jump over large unmapped holes.

What changed: technical summary​

The root cause, in plain terms​

The old scan implementation used a per-address approach—a loop that advanced one page address at a time and checked whether that address had a mapped page suitable for merging. That model is simple but oblivious to VMA sparsity: it performs the expensive rmap/page lookup for every page-sized address within the VMA’s bounds, even when most addresses are unmapped. In scenarios with very large (but mostly unpopulated) mappings, this produces an enormous number of wasted lookups.

The fix​

The recent patch swaps the per-address iteration for a range-walking approach using walk_page_range (and the associated range-walker callbacks). Instead of checking page-by-page, the range walker walks VMAs and physical mappings at a higher level and can skip entire unmapped regions in one step. The new path inspects contiguous mapped ranges and only visits addresses that actually have a folio/pmd/pte entry to inspect, drastically reducing unnecessary work. This preserves KSM’s semantics—identifying mergeable pages—while making scanning time proportional to the amount of mapped memory rather than the virtual address span.

Patch lineage and acceptance​

The change was submitted to the kernel memory-management mailing lists as a patch series and iterated across multiple versions (v2, v3, v4) to handle corner cases such as transparent hugepages (THPs) and maintain minimal invasive changes. The patch landed in upstream stable trees and has been imported into vulnerability trackers and distro advisories as the canonical upstream remediation. Kernel reviewers categorized the patch as low-risk and suitable for stable backporting because it is an optimization and correctness improvement that does not alter KSM’s observable behavior except for performance.

Reproducing the behavior: the 32 TiB test​

The public reproduction used in the patch discussion is intentionally simple and revealing:
  • Create a 32 TiB non-reserved anonymous mapping via mmap with MAP_NORESERVE.
  • Touch a single byte within that mapping to create one anon_vma/mapped page.
  • Enable MADV_MERGEABLE on that mapping and start ksmd.
On affected kernels (those without the range-walk patch), ksmd will scan the whole 32 TiB address range page-by-page, driving CPU to 100% for protracted periods while merging nothing of material value. With the patched behavior, ksmd quickly finds and processes the single mapped page and skips the 32 TiB of unmapped holes, resulting in a scan that completes in seconds and consumes very little CPU. This clear, experimental reproduction formed much of the impetus for the change.

Impact and risk analysis​

Practical impact: availability and resource waste​

This issue is fundamentally a correctness/performance bug rather than a memory-corruption or privilege-escalation vulnerability. The immediate, observable impact is:
  • Excessive CPU consumption on systems that enable KSM and scan large, sparse VMAs.
  • Potential service degradation and higher latencies for workloads co-located on a host where ksmd is monopolizing CPU.
  • In multi-tenant or cloud contexts, a malicious or buggy tenant could intentionally create sparse huge mappings to cause host resource exhaustion, producing a reliable local denial-of-service (DoS) vector.

Exploitability and attack surface​

  • Attack vector: local process or guest with the ability to create large sparse VMAs and call madvise(MADV_MERGEABLE), thereby causing ksmd to scan the range.
  • Remote exploitation: not applicable by itself—there’s no remote network vector that triggers ksmd scanning without local code execution or guest/tenant ability to influence the host’s memory operations.
  • Privilege escalation: none documented. The defect does not permit memory corruption or code execution; it is a performance/availability issue.

Who should care most​

  • Cloud hypervisor hosts and multi-tenant VPS providers where untrusted tenant code can create large sparse mappings and request MADV_MERGEABLE.
  • CI/CD runners, shared build servers, and container hosts that accept untrusted images or code.
  • Desktop systems that arbitrarily enable KSM are lower-risk unless they run workloads that intentionally create huge sparse mappings.

How vendors have cataloged the issue​

This change has been imported into public vulnerability databases and vendor trackers as CVE-2025-68211 with a description that focuses on the KSM scanning inefficiency and the switch to walk_page_range. OSV and NVD show the same summary and link the issue back to the upstream commits; several distro trackers (for example SUSE) have recorded the issue and mark it as resolved once the upstream patch is present in the kernel package or vendor backport tree. Because the remediation is a kernel source change, distribution package maintainers will either ship kernels containing the upstream commit or provide a stable backport for existing kernel-release branches. Administrators must consult their distribution security advisories for exact package mappings.

Remediation and operational guidance​

Immediate actions (priority order)​

  • Patch the kernel: install a kernel that contains the upstream walk_page_range patch for KSM (or a vendor backport). This is the definitive fix—ksmd behavior will be corrected once the running kernel is updated and the host is rebooted into the new image.
  • Monitor ksmd CPU: while scheduling updates, configure host monitoring and alerts for ksmd CPU usage spikes. Use standard host telemetry (top/htop, systemd-cgtop, ps + cpu counters, Prometheus node exporter) to identify anomalous ksmd consumption. High and persistent ksmd CPU usage correlated with large MADV_MERGEABLE activity is a strong indicator of this problem being hit. (This guidance is operational and does not require external confirmation.
  • Limit MADV_MERGEABLE exposure: where possible on multi-tenant hosts, restrict the ability of untrusted workloads to call madvise or create extremely large mappings. This may require container runtime adjustments, RBAC policies, or workload placement changes.
  • Short-term mitigations: if immediate patching is impossible, consider disabling KSM on critical hosts (echo 0 > /sys/kernel/mm/ksm/run) or isolating untrusted workloads to dedicated nodes until a kernel update can be deployed. Disabling KSM removes the deduplication benefits but also removes the attack surface for this local DoS vector.

How to verify a kernel is patched​

  • Confirm kernel package changelog or vendor advisory mentions CVE-2025-68211 or references the upstream commit(s).
  • Test on a non-production host by running the 32 TiB reproduction (or a scaled-down version) and observing ksmd CPU: patched kernels should detect only the mapped page and skip over the huge unmapped regions quickly, while unpatched kernels will show sustained high CPU. Use controlled, careful testing—don’t perform destructive experiments on production machines.

Why the patch is safe and why it landed in stable trees​

The change is algorithmic and performance-oriented: it replaces a brute-force per-address check with a walker that leverages existing page-range traversal primitives. Maintainers judged the edit to be low-risk because it preserves logical behavior (the same pages are considered for merging) while dramatically improving efficiency on pathological inputs. Because of that minimalism and the clear operational benefits, the patch was accepted for stable backports so that distributions can ship fixes without refactoring larger subsystems. Multiple iterations of the patch addressed THP handling and minimized code churn, further reducing regression risk.

Detection and telemetry: what to hunt for​

  • Kernel-level signals: unexpected behaviors in ksmd are visible via CPU profiles and kernel logs (dmesg/journalctl). There are no specific exploit signatures beyond ksmd resource consumption and race-correlated events.
  • Host telemetry: watch for sustained ksmd CPU usage, unexplained load spikes, and correlated high user-mode and system-mode CPU on hosts with KSM enabled.
  • Behavioral detection: repeated occurrences linked to processes that create giant MAP_NORESERVE mappings and call madvise(MADV_MERGEABLE) are the primary operational indicator.
Implementing these detection measures complements the remediation checklist and helps prioritize updates for hosts that are genuinely impacted rather than universally treating all machines as equal risk.

Cross-checks and verification​

Key claims in this analysis were verified against multiple independent sources:
  • The NVD entry and OSV show the CVE text and the reproduction scenario (32 TiB test), validating the technical description and the remediation approach.
  • The kernel mailing-list patch discussion and v4 patch series contain the code-level description and the rationale for switching to walk_page_range, including handling for THPs and minimal change strategy. These confirm the implementation detail we describe.
  • Vendor trackers and CVE aggregators echo the same characterization and recommended remediation path—install a kernel that contains the upstream patch. This cross-verification confirms the operational guidance above.
Where vendor package mapping or exact KB/package names are needed, those are intentionally left to distribution advisories: backport status varies by vendor and kernel branch, so an operator should rely on the distro’s security tracker to identify the exact package version to deploy.

Practical checklist for administrators​

  • Inventory: identify hosts with KSM enabled.
  • Command examples: zgrep CONFIG_KSM /boot/config-$(uname -r) (or inspect /sys/kernel/mm/ksm/run and /proc/cmdline for runtime flags).
  • Prioritize: schedule updates for hosts that:
  • Run multi-tenant workloads, CI runners, or host untrusted workloads.
  • Show high ksmd CPU usage or host load correlated with memory-management activities.
  • Patch: obtain vendor kernel updates or upstream stable backports containing the walk_page_range patch and deploy in rolling fashion.
  • Validate: after reboot, re-run selective tests to ensure ksmd no longer consumes excessive CPU on sparse huge mappings.
  • Mitigate: if patching is delayed, temporarily disable KSM on sensitive hosts and isolate untrusted tenants until remediation is complete.

Caveats and unverifiable claims​

  • Distribution backports: the presence and timing of stable backports depend on individual vendor policies and release engineering. Administrators must verify with their distribution’s security advisory. The upstream patch exists; the vendor mapping is outside the kernel tree and therefore not universally predictable.
  • Exploit narratives: some write-ups might dramatize performance bugs as “exploitable.” This change fixes a local DoS/resource-abuse vector; there is no publicly documented privilege-escalation or remote-execution proof-of-concept associated with this CVE at the time of publication. Treat any stronger claims as unverified until reproducible exploit code is published by reputable sources.

Conclusion​

CVE-2025-68211 is a focused, practical improvement to KSM scanning that corrects an inefficiency with real-world consequences: unpatched kernels could allow ksmd to waste massive CPU resources by walking huge, mostly unmapped VMAs page-by-page. The kernel community’s fix—replacing the per-address logic with a range-walker using walk_page_range—is small, low-risk, and effective. Operators on multi-tenant, cloud, or CI infrastructure should treat this as a priority patch for availability reasons, verify vendor package mappings for their distro, monitor ksmd CPU usage, and apply mitigations where immediate patching is not possible.
Source: MSRC Security Update Guide - Microsoft Security Response Center
 

Back
Top