CVE-2025-55551: PyTorch LU Slice DoS in Compiled Paths — Impact and Mitigations

  • Thread Author
An exploitable defect in PyTorch’s linear algebra implementation — tracked as CVE-2025-55551 — allows attackers to trigger a denial-of-service (DoS) condition when performing a slice on the output of torch.linalg.lu in PyTorch v2.8.0; the problem arises in compiled execution paths (Inductor / torch.compile) and has been confirmed in public issue threads and multiple vulnerability trackers, leaving many packaged distributions and container images flagged as vulnerable pending vendor fixes.

Blue neon tech collage featuring PLU, DoS 2.8.0, a warning icon, and circuit motifs.Background / Overview​

PyTorch is one of the most widely used machine-learning frameworks for research and production. The project introduced program compilation backends (notably the Inductor backend exposed through torch.compile) to accelerate models, but those compiled code paths change how operators are represented and combined at runtime. A mismatch between how torch.linalg.lu behaves when used directly in eager mode and how it behaves when compiled has produced a failure mode: attempting to slice the tuple-like return values of torch.linalg.lu inside compiled models can raise unexpected exceptions or drive resource exhaustion, permitting a remote attacker who can submit code or models to cause process crashes or prolonged hangs. The original reproducer and discussion are in the PyTorch issue tracker. Security databases and Linux distribution trackers have consolidated the report under the CVE identifier CVE-2025-55551 and flagged PyTorch releases up to and including 2.8.0 as affected. Public CVE aggregators and vendor trackers assign the vulnerability a high availability impact (CVSS v3.1 commonly reported as 7.5/High) with the core weakness classed as uncontrolled resource consumption (CWE‑400).

What exactly is vulnerable?​

The API and the execution modes​

torch.linalg.lu performs an LU factorization and typically returns a tuple (P, L, U) or equivalent outputs that users often unpack or slice. In eager (normal, interpreter-driven) execution the return values and slicing behave as users expect. When models are compiled with torch.compile — and specifically when the Inductor backend is used — the compiler transforms operations and builds an alternative execution graph. The issue arises when the compiled graph encounters code that slices the output of torch.linalg.lu (for example, P, L = torch.linalg.lu(x)[:2]); under certain compiled paths this triggers an internal path that either raises an internal TypeError or consumes resources in an unbounded way, causing worker processes to crash or hang. The original report includes a minimal repro that demonstrates the failure in aot/aot_eager/Inductor backends while eager mode succeeds.

Technical classification​

  • Primary impact: Denial of Service (availability). The defect either causes immediate process termination or sustained resource consumption that makes the affected component unavailable.
  • Weakness family: CWE‑400 — Uncontrolled Resource Consumption (resource exhaustion) and implementation mismatch between compiled and eager semantics.
  • Attack vector: Network / Remote (when attacker can supply models, code, or data that triggers the compiled path), or local (in single-tenant setups where only trusted code runs, the risk is smaller but not zero). Many trackers and distributors flag the attack complexity as low and privileges as none for scenarios where untrusted models or code can be compiled and executed.

Evidence and cross-checks​

Public evidence for the vulnerability is straightforward and replicable:
  • The official PyTorch issue — filed and triaged as high priority — includes a small code sample that reproduces the problem when the model is compiled with Inductor while eager execution succeeds. That thread is the canonical developer discussion and repro.
  • National and public vulnerability trackers (NVD / MITRE entries) list CVE‑2025‑55551, describe the torch.linalg.lu slice issue, and map the affected versions (<= 2.8.0). These entries have been consumed by downstream scanners and vendor trackers.
  • Linux distribution security trackers (Debian, Ubuntu) and multiple vulnerability databases (Tenable, Snyk, CVE aggregators) list the CVE with a high availability impact and indicate that, at the time of publication, no upstream packaged fix has been universally published by all downstream vendors. This means distributions are still evaluating or awaiting backports.
Where vendors or distributions have published status, they generally recommend updating to a patched PyTorch release when one becomes available and, until then, applying mitigations or isolating affected workloads. Several maintainers highlight the operational nuance that many production images and container repositories can continue to carry vulnerable wheels until images are rebuilt and redeployed.

Practical impact: who should care and why​

  • Research labs and production ML services that accept untrusted models or run user-sent code (public notebook services, multi‑tenant training clusters, hosted model marketplaces) are the highest risk. An attacker can upload or submit a model that, when compiled by the host environment, triggers the DoS and disrupts other tenants or the service itself.
  • Containerized deployments and CI/CD pipelines that build, compile, or test models automatically are vulnerable if they compile user-supplied models with torch.compile/Inductor without sandboxing.
  • Desktop or single‑tenant environments that only run trusted code are lower risk for remote exploitation but should still be cautious: silent failures in numerical results or occasional crashes can still harm integrity-sensitive workloads (regulated ML, financial models, experiments with reproducibility requirements).
Operationally, this vulnerability is not a classic memory-corruption RCE; it’s a logic/miscompilation/resource problem that produces availability loss or silent miscompute. Those failure modes are often harder to detect — they may look like hangs, timeouts, or silent training/inference divergence rather than an obvious exploit fingerprint.

Verification and detection: how to check if you're affected​

Use simple host-level and image-level checks to find vulnerable PyTorch builds:
  • Run a quick Python version test inside images, containers, or hosts:
  • python -c "import torch; print(torch.version)"
  • python -c "import torch; print(torch.version)" (some builds expose more fields)
  • Inspect container manifests and Dockerfiles used to build your environment and check for wheels or package installs that pin torch==2.8.0 or an earlier vulnerable version.
  • For curated cloud images (Azure ML environments, Databricks runtimes, operator-provided images), consult the image manifest metadata and vendor release notes: these often list included framework versions. Do not assume an attestation for one product means all vendor images are safe — Microsoft’s Azure Linux attestation for related PyTorch CVEs is a good signal for that specific product but does not prove other images (Azure ML curated environments, Databricks, marketplace appliances) are unaffected — these must be inventoried individually.
Detection tips:
  • Add unit/regression tests that exercise torch.linalg.lu followed by standard slicing/unpacking in both eager and compiled modes (torch.compile with Inductor). Fail the build if discrepancies appear.
  • Alert on unexpected worker crashes, long-running compilation steps, or repeated timeouts in model-serving endpoints.
  • Monitor image build pipelines and package metadata scans for torch==2.8.0 or pinned wheels.

Immediate mitigation and remediation guidance​

At time of writing, many trackers and distributions report no universally available vendor backport; upstream PyTorch activity is the canonical place to monitor for an official fix. In the meantime, apply the following prioritized controls.

1) Short-term mitigations (apply immediately where feasible)​

  • Avoid compiling affected code paths: Do not run torch.compile/Inductor on models that call torch.linalg.lu followed by slicing/unpacking. If you control model code, refactor to avoid proto-forms like torch.linalg.lu(x)[:2] inside compiled sections.
  • Run in eager mode for untrusted inputs: If your service compiles user models by default, offer an option to run compilation-free (eager) execution for untrusted submissions while you patch.
  • Sandbox model execution: Execute user-submitted models in isolated VMs or containers with strict CPU/memory limits and per-job timeouts; this prevents a single miscompiled job from taking down multi-tenant hosts.
  • Rate-limit and pre-validate uploads: Block or flag large-scale model submission patterns or models that include suspicious compilation hooks.

2) Medium-term remediation (once vendor patches are available)​

  • Upgrade PyTorch to a patched upstream version as soon as the PyTorch project publishes a fix and wheel. Rebuild any containers and redeploy images that include the updated wheel.
  • Rebuild and republish curated images (Azure ML environments, Databricks runtime images, custom curated images) and enforce image signing and provenance to avoid accidental rollbacks to vulnerable images.
  • Patch CI runners and developer images used to run tests that compile user models; replace vulnerable base images and enforce rebuild pipelines.

3) Long-term hardening​

  • Add compiled-vs-eager regression tests to CI that assert identical results for a set of representative operators under torch.compile; treat divergence as a release blocker for performance/compatibility changes.
  • Harden model submission policies: require review for user-submitted compilation flows before they run at scale.
  • Telemetry: instrument model-serving systems to track deviations between expected and actual runtime shapes/latencies and route suspicious jobs to manual review.
A practical checklist for busy teams:
  • Inventory images and hosts (including WSL/DSVM and container registries) and grep for torch wheels or pip installs.
  • If you find torch <= 2.8.0, stop compiling untrusted models on those hosts.
  • Implement per-job resource caps and job-level timeouts.
  • Subscribe to PyTorch project updates and vendor advisories; rebuild images as soon as a patched wheel appears.

Why vendor attestations are helpful — and why they aren’t a substitute for inventory​

Vendor attestation (for example, a vendor publishing a VEX/CSAF entry that a specific product includes an affected library) is a valuable automation signal that allows security teams to triage quickly. However, attestations cover only the product and image set the vendor validated — they do not guarantee that other images, curated runtimes, or third-party marketplace appliances are clean. In cloud ecosystems, PyTorch appears across multiple artifacts (base distro images, curated ML images, Databricks runtimes, marketplace VMs), and each must be checked independently. Rely on vendor attestations as one input in your automated pipeline but still run host-level checks to confirm the actual binary version in your estate.

Risk analysis and editorial assessment​

Strengths of the public disclosure​

  • The public issue and CVE mapping provide a clear technical repro that operations teams can use to test their environments; that makes triage deterministic.
  • Public trackers assign a high availability impact which helps defenders prioritize remediations in multi-tenant or internet-facing contexts.

Notable limitations and residual risks​

  • This is primarily an availability/correctness vulnerability rather than a remote code execution primitive; that reduces the immediacy of some threat models but increases operational surprise risk (silent failures or intermittent hangs).
  • Packaging and image lag are the biggest operational problem: many container registries and vendor images still carry older wheels and will remain vulnerable until rebuilt and redeployed. Rebuilding images is often slow and error-prone for complex ML stacks, creating a "long tail" of exposure.
  • Some public write-ups and automated trackers report high CVSS scores and “remote” attack vectors; defenders should match that general view against their specific architecture — e.g., a single-tenant research VM that never compiles untrusted models is far less exposed than a cloud-hosted sandbox that compiles user code.

Detection difficulty​

A denial-of-service caused by a compiled operator mismatch can manifest as:
  • sudden worker process termination,
  • a spike in job latency and resource usage,
  • silent numeric divergence (in correctness-sensitive workloads) if code paths return different values.
These symptoms do not map cleanly to standard IDS signatures; they require thoughtful telemetry and comparison checks (compiled vs eager) to detect reliably.

Actionable remediation plan (step-by-step)​

  • Inventory (immediate)
  • Enumerate running systems, containers, and images that include PyTorch. Use scripted checks inside containers: python -c "import torch; print(torch.version)".
  • Contain (immediate)
  • If you run multi-tenant services that compile user models, disable automatic compilation and move user jobs into isolated execution sandboxes.
  • Harden (short-term)
  • Introduce per-job CPU/memory quotas, strict timeouts, and resource accounting for compiled jobs.
  • Test (short-term)
  • Add CI assertions that run the minimal repro (from the public PyTorch issue) in both eager and compiled modes and fail if compiled results throw or diverge.
  • Patch (medium-term)
  • When PyTorch publishes a patched wheel that resolves the issue, update wheels, rebuild images, and redeploy in a controlled rollout.
  • Validate (post-patch)
  • Re-run the regression suite and the compiled-vs-eager tests across your fleet. Validate image manifests and SHA-signed wheels before mixing into production.
  • Monitor (ongoing)
  • Monitor for CVE updates, distribution advisories, and attestations. Subscribe to vendor VEX/CSAF feeds but confirm via inventory checks.

Closing assessment​

CVE‑2025‑55551 is a practical illustration of the security surface that emerges when high-performance compilers and dynamic ML frameworks intersect. The underlying issue is not a memory-corruption exploit that yields remote code execution; it is a semantic mismatch and resource-exhaustion problem that yields availability loss and silent miscompute. That makes it both easier to triage (clear repro in the PyTorch issue tracker) and harder to detect in production (effects can be subtle). Immediate defensive action is straightforward: inventory your estate, avoid compiling untrusted models on vulnerable builds, sandbox and resource‑limit jobs, and prioritize rebuilds of curated images once an upstream patch is available. Vendor attestations and distribution advisories will help you prioritize, but host-level verification and CI regression coverage are the operational controls that will prevent surprises in complex, multi-tenant ML deployments.
Teams that rely heavily on PyTorch for production workloads should treat this CVE as a reminder to add compiled-vs-eager regression testing to their release pipelines, tighten model submission policies, and enforce image provenance so that fixes — once published — can be deployed rapidly and consistently across the estate.

Source: MSRC Security Update Guide - Microsoft Security Response Center
 

Back
Top