A subtle race in the Linux kernel’s POSIX CPU timer handling — tracked as CVE-2025-38352 — was fixed upstream in July 2025 after maintainers accepted a small, surgical change that prevents an exiting task from being reaped while posix CPU timer expiry handling is in flight. The flaw could lead to incorrect timer deletion behavior, hangs or crashes in pathological workloads, and was rated as a high‑impact local vulnerability by multiple vendors; operators should treat it as an availability and stability risk and prioritize applying vendor patches or backports immediately.
POSIX CPU timers are the kernel infrastructure that implement per-process and per-thread CPU-time timers (the clocks used by timer_create(2) with CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID, and related interfaces). The kernel several years ago evolved optimistic fast paths and lock patterns to make timer expiry handling scale on SMP systems. Those performance-oriented changes create narrow windows where the timer expiry path (which can be invoked in interrupt context) races with timer deletion and process exit code paths, because those operations touch the same timer and task bookkeeping under different locks. CVE-2025-38352 is one of these edge-case races: when a non-autoreaping task has passed exit_notify() and an IRQ context runs handle_posix_cpu_timers(), the task may be reaped by its parent or a debugger right after unlock_task_sighand(), and a simultaneous posix_cpu_timer_del() call may then fail to detect that a timer is firing — producing incorrect behavior that can lead to crashes, hangs, or data-structure corruption.
In plain language: a task that is in the process of exiting can be observed in an inconsistent state by the CPU-timer expiry code; if a concurrent timer deletion races with that, the kernel could miss the firing flag on a timer or fail to take required locks, which opens windows for use-after-free or logic errors in timer bookkeeping. Upstream’s fix adds an explicit check of the task’s exit state in run_posix_cpu_timers() so that expiry handling will bail out when the target task is already on the path to being released; this avoids the narrow TOCTOU (time‑of‑check, time‑of‑use) window.
Two implementation details to note:
Source: MSRC Security Update Guide - Microsoft Security Response Center
Background / Overview
POSIX CPU timers are the kernel infrastructure that implement per-process and per-thread CPU-time timers (the clocks used by timer_create(2) with CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID, and related interfaces). The kernel several years ago evolved optimistic fast paths and lock patterns to make timer expiry handling scale on SMP systems. Those performance-oriented changes create narrow windows where the timer expiry path (which can be invoked in interrupt context) races with timer deletion and process exit code paths, because those operations touch the same timer and task bookkeeping under different locks. CVE-2025-38352 is one of these edge-case races: when a non-autoreaping task has passed exit_notify() and an IRQ context runs handle_posix_cpu_timers(), the task may be reaped by its parent or a debugger right after unlock_task_sighand(), and a simultaneous posix_cpu_timer_del() call may then fail to detect that a timer is firing — producing incorrect behavior that can lead to crashes, hangs, or data-structure corruption.In plain language: a task that is in the process of exiting can be observed in an inconsistent state by the CPU-timer expiry code; if a concurrent timer deletion races with that, the kernel could miss the firing flag on a timer or fail to take required locks, which opens windows for use-after-free or logic errors in timer bookkeeping. Upstream’s fix adds an explicit check of the task’s exit state in run_posix_cpu_timers() so that expiry handling will bail out when the target task is already on the path to being released; this avoids the narrow TOCTOU (time‑of‑check, time‑of‑use) window.
What the bug is — technical anatomy
Where the race lives
The problematic interaction is between two kernel code paths:- handle_posix_cpu_timers() / run_posix_cpu_timers(): the code that iterates timers and fires them; it may be invoked from interrupt context (timer expiry) or from a soft/task work context to run per-task timer callbacks.
- posix_cpu_timer_del(): the code path that deletes a CPU timer (for example when a user calls timer_delete(), when a process exits, or when timers are reaped during exec/exit flows).
The short exploitability story
This is a local race condition rather than a remote, network-facing bug. It requires a combination of:- The POSIX CPU timers subsystem being used (i.e., the code paths for these timers compiled in),
- A workload where an IRQ or timer expiry triggers handle_posix_cpu_timers() against a task that is in the process of exiting, and
- A concurrent posix_cpu_timer_del() invocation racing with the expiry path at the narrow window after unlock_task_sighand().
The upstream fix — what changed and why it matters
Upstream maintainers implemented a minimal change: add an explicit check for tsk->exit_state in run_posix_cpu_timers() and return early if the task is on the exit path. The patch inserts a small guard that ensures expiry handling does not proceed for a task that can be released at any moment, eliminating the TOCTOU race that allowed posix_cpu_timer_del() to miss a firing timer or to fail locking operations. The change is compact (a handful of lines) and was accepted into stable trees and vendor backports as a low-risk, targeted fix.Two implementation details to note:
- The fix is defensive: even on configurations where CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y (where exit_task_work() runs before exit_notify()), the check still makes sense because task_work_add(&tsk->posix_cputimers_work.work) could fail under some conditions. That is, it covers both set-ups and remains harmless where the configuration already prevents the race.
- The patch was cherry-picked into distribution stable update series (for example Canonical’s SRU patchset) and referenced by vendor advisories — a sign that maintainers considered it safe and urgently backportable.
Impact assessment: confidentiality, integrity, availability
- Availability: the primary real-world impact is availability. Race conditions in kernel timer and task lifetime code commonly manifest as hung CPUs, RCU stalls, kernel WARN/OOPS messages, and in severe cases hard lockups and panics. Vendors mapped the vulnerability to availability and to potential integrity/confidentiality issues in worst-case exploitation scenarios. The Amazon Linux advisory lists the severity as “Important” and gives a CVSSv3 base score consistent with significant local impact.
- Integrity / Confidentiality: while the bug is not a straight-line remote privilege escalation, kernel memory corruption and TOCTOU races can sometimes be weaponized by skilled attackers with local access to achieve privilege escalation or arbitrary kernel-memory manipulation. Vendor write-ups treat this as a high-severity kernel bug because of the class it belongs to (task/timer lifetime races). Rapid7 and Red Hat trackers assigned high-impact classifications reflecting the broad possible consequences.
- Attack vector / likelihood: the vector is local (AV:L) and timing-sensitive. That reduces the widespread opportunistic exposure compared to network worms, but the presence of public CVE listings and vendor advisories makes the fix high-priority for multi-tenant systems, CI runners, build servers, or environments that allow untrusted local workloads. Several vulnerability databases flagged the CVE as KEV (Known Exploited Vulnerability) or tracked it as having KEV status; operators should treat it as prioritized remediation if they see relevant vendor advisories in their environment.
Who is affected
- Most general-purpose Linux distributions that include the POSIX CPU timers code in their kernels are potentially affected. Vendors including Red Hat, Debian, Canonical/Ubuntu, SUSE, and the Amazon Linux team have recorded the CVE and issued advisories or fixes for relevant kernel packages. Affected kernel package versions vary by distro and branch; operators must consult their distribution’s advisory to locate the exact fixed package for their kernel branch.
- Embedded devices, vendor appliances, and long‑lifecycle systems are at disproportionate risk. Those devices often run vendor-pinned kernels that are slow to receive upstream fixes or whose vendors may not provide a backport. The small size of the upstream fix makes it backportable, but the long tail remains an operational reality for many organizations.
- Config-dependent environments: systems built with CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y are less exposed in practice because exit_task_work() runs earlier and already addresses the ordering. Still, the upstream patch is harmless in those builds and was applied universally in stable trees.
Detection and indicators
There is no simple, universal telemetry that says “this CVE was triggered.” Instead, operators should watch for the operational signatures that commonly accompany timer/lifetime races:- Kernel WARN/OOPS reports or stack traces referencing posix-cpu-timers, run_posix_cpu_timers(), handle_posix_cpu_timers(), posix_cpu_timer_del(), or release_task() and unlock_task_sighand().
- RCU stall messages and hard lockup messages that mention CPU stalls during timer expiry or task exit flows.
- Reproducible crashes or hangs when running workloads that create and delete timers rapidly while forking/exiting threads.
- Increased rate of kernel oops events or kdump captures correlated with workload churn on CI runners, multi-tenant platforms, or build machines.
Recommended mitigation and remediation playbook
- Inventory and prioritize
- Identify hosts that run kernels from affected series (consult the vendor advisories for the exact kernel package ranges). Prioritize multi-tenant systems, cloud guest images, CI runners, and any hosts that allow untrusted users to create processes/timers. Use your configuration management database (CMDB) and package-management queries to build this list.
- Patch promptly
- Apply vendor-supplied kernel updates that backport the upstream fix. Major vendors have released updated packages and SRUs; where livepatch offerings exist, evaluate whether the vendor’s livepatch covers this specific commit and can be applied without a reboot. Canonical’s SRU, Amazon’s ALAS entries, Red Hat advisories and SUSE advisories reference patched packages. Test the vendor package in staging before broad rollout.
- Short-term containment (if you cannot patch immediately)
- Restrict local code execution: limit which users or services can run untrusted binaries on high-risk hosts.
- Harden process controls: disable unnecessary ptrace/debugging access for untrusted users and restrict capabilities that allow process reaping or attaching where possible.
- Increase monitoring and logging: ensure kernel oopses and RCU stall messages are centrally collected and alert on increases in frequency.
- Avoid running workloads that aggressively create/delete CPU timers under heavy process churn on unpatched systems; move such workloads into patched hosts.
- For appliances / embedded devices
- Contact the device vendor for a signed firmware/kernel update schedule. If vendor updates are not available and the device is critical, consider isolation, segmentation, or replacement plans. Vendors sometimes provide mitigations or custom backports for long‑lifecycle devices.
- Post-patch validation
- Run churn tests that mimic production load (rapid process/thread creation and CPU-timer churn) in a staging environment to validate the fix.
- Reproduce previously observed kernel oopses (in a controlled test lab) to confirm the issue no longer manifests.
- Consider configuration hardening
- If you manage custom kernel builds, consider building with CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y where appropriate (this configuration reduces certain race windows by invoking exit_task_work() earlier). Note that kernel config changes require rebuilds and thorough validation for compatibility and performance.
Why this change is low‑risk but important to deploy
The upstream patch is intentionally narrow: it adds a single guard and returns early when the target task is on the exit path. That makes the change easy to review, backport, and test; it avoids complex lock rework or heavy refactors. Because the fix is localized and conservative rather than adding aggressive locking, maintainers backported it quickly into stable series and vendors incorporated it into their SRUs. The practical benefit is removing a rare but real stability hazard without materially changing runtime behavior for the vast majority of workloads.Residual risks and follow-on considerations
- Config variability: the fix is safe and sensible even when CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, but environments that differ in kernel config or vendor patches may behave differently. Always verify the exact patch set applied to your kernel branch.
- Long‑tail devices: embedded and vendor-pinned kernels may never receive a backport. The operational risk here is the usual "long tail" problem: devices that cannot be patched must be inventoried, monitored, and isolated where necessary.
- Other timer/task races: this CVE fixes a specific TOCTOU race. Kernel timer and task-lifecycle code has a long history of subtle interactions; operators should treat this fix as one patch in a broader hygiene program that includes aggressive kernel updates, sanitizer testing, and early staging. Failing to maintain kernel patch discipline invites other, unrelated races to surface. Vendor advisories often list multiple, related fixes around similar subsystems — remediation should be holistic.
- Livepatch coverage: livepatch solutions vary by vendor and kernel branch. Verify whether a vendor livepatch covers this commit before relying on it as the sole remediation path. Some vendors included livepatches; others required full kernel upgrades.
Practical detection recipe and forensics checklist
- If you see a kernel oops that includes functions like run_posix_cpu_timers(), handle_posix_cpu_timers(), posix_cpu_timer_del(), release_task(), unlock_task_sighand(), or task_work_add-related errors, treat it as a high-priority incident and gather:
- The kernel oops log and call trace (dmesg, journalctl, or kdump output).
- The kernel version string and CONFIG options (from /proc/config.gz or the build tree).
- The process tree and workload pattern that triggered the event (what processes were creating/deleting timers; any use of debuggers / ptrace).
- Reproduction steps in a staging lab (do not reproduce on production hosts).
- Share findings with your distro vendor or upstream maintainers if you suspect the patch did not resolve the issue on your specific kernel branch. Vendor SRUs and bug trackers will often request logs and reproduction steps.
Cross‑verification and provenance
This article’s technical summary is derived from upstream commit diffs and vendor advisories. The NVD entry documents the vulnerability description and reasoning used to prepare vendor advisories. Canonical/Ubuntu published a stable SRU patch that includes the exact diff showing the tsk->exit_state guard; Amazon Linux, Red Hat, SUSE, Debian and other vendors recorded the CVE in their trackers and published fixes or advisories mapping the vulnerability to fixed package versions. The upstream fix references the original maintainers and the submitter; reporting and patch authorship credits appear in the SRU and the CVE metadata. These independent sources (upstream patch, NVD, multiple vendor trackers) converge on the same technical root cause and remediation, which strengthens confidence in the analysis.Recommended prioritized checklist for SREs and system administrators
- Immediately inventory kernels and mark systems that match the affected families (consult distro advisories).
- Apply vendor-supplied kernel updates in staging and validation lanes; schedule a rolling rollout for production. Use vendor livepatches only if they explicitly cover this commit and are validated.
- For hosts that cannot be patched promptly: increase monitoring for timer/RCU-related kernel oops and isolate high-risk workloads.
- For cloud images and VM templates: rebuild and redeploy images with updated kernels to avoid reintroducing vulnerable instances.
- For appliances and embedded devices: engage vendors for a firmware/kernel update plan; where unavailable, create an isolation or replacement plan.
- Document the incident in your change-control and SBOM systems; add the fixed kernel version(s) to your golden images and CI pipeline baselines.
Conclusion
CVE-2025-38352 is a textbook example of how tiny timing windows in low-level kernel code can have outsized operational consequences. The bug fixed — a TOCTOU race between posix CPU timer expiry handling and timer deletion during task exit — is narrow and subtle, but its manifestations are concrete: kernel hangs, oopses, and potential integrity failures in the worst case. The upstream remedy is small, safe, and backportable: add an explicit exit-state check to abort expiry handling for tasks already on the exit path. That fix has been landed and distributed in vendor advisories; operators should treat the CVE as high‑priority for kernel updates, especially in multi‑tenant and high‑churn environments. Rapid, disciplined patching and staging validation remain the best practical defenses against this class of kernel correctness faults.Source: MSRC Security Update Guide - Microsoft Security Response Center