CVE-2025-55554: PyTorch 2.8 Overflow, Azure Linux Attestation & Mitigation

  • Thread Author
PyTorch 2.8.0 carries an integer‑overflow correctness bug in the torch.nan_to_num(....long code path that has been assigned CVE‑2025‑55554, and while Microsoft has publicly attested that Azure Linux includes the impacted open‑source library, that attestation is an inventory statement — not proof that no other Microsoft product or image includes the vulnerable PyTorch binary.

PyTorch 2.8.0 CVE-2025-55554; security flaw tied to integer overflow, shown with shield.Background / Overview​

The CVE record for CVE‑2025‑55554 describes an integer overflow triggered by code paths around torch.nan_to_num followed by a .long conversion in PyTorch 2.8.0. Public vulnerability trackers list the defect, show a medium severity magnitude, and point to an upstream PyTorch issue containing a minimal repro and developer discussion. This is primarily a correctness and integer‑wrap problem (CWE‑190) that can produce incorrect numeric outputs and, in some scenarios, denial‑of‑service behavior. Because the failure mode is numerical/correctness‑oriented rather than a straightforward memory‑corruption crash, it can be silent — producing wrong model outputs without obvious exceptions. Several downstream distribution trackers (Debian, Debian security tracker) and vulnerability aggregators have ingested the record and flagged affected package versions. Microsoft’s public statement about the issue — available via its MSRC/CV E update channel — says Azure Linux includes this open‑source library and is therefore potentially affected, and that Microsoft has begun publishing machine‑readable CSAF/VEX attestations (starting with Azure Linux) and will update the CVE/VEX record if additional Microsoft products are identified as carriers of the vulnerable component. That is an important operational distinction: Microsoft has validated and attested Azure Linux, and pledged to expand attestation scope if more product mappings are discovered.

What the vulnerability actually is — technical summary​

How the bug manifests​

  • The defect occurs when code chains a call to torch.nan_to_num (to replace NaN/Inf values) and then converts the result to an integer type with .long. In certain compiled execution flows (notably when using torch.compile / Inductor), the compiled path can produce values that differ from eager execution because of integer promotion, rounding, or overflow semantics in lowered code. The upstream GitHub issue includes a minimal repro that demonstrates reciprocal outputs between eager and compiled runs.
  • Public feeds classify the problem as an integer overflow / wraparound (CWE‑190) and assign a medium CVSS rating (around 5.3 on many trackers), reflecting medium severity for remotely reachable, low‑complexity correctness issues. The NVD entry and distribution trackers mirror the same high‑level description.

Practical consequences​

  • Silent data corruption: models that rely on deterministic numerical transforms can train or infer incorrectly without throwing exceptions, producing subtle integrity failures in ML pipelines.
  • Denial of service / instability: in some environments, unexpected numeric values may cascade into exceptions, memory misuse or unhandled edge cases that cause the process to terminate.
  • Exploitability profile: there are no widespread, reliable remote RCE proofs tied to this CVE. The highest‑value attack surface is multi‑tenant or shared compute where an attacker can submit model code or inputs that trigger the problematic compiled path. On single‑tenant, locked environments running trusted code, the practical attack surface is smaller — but correctness risks remain for safety‑critical or regulated workloads.

Microsoft’s attestation: what it covers — and what it doesn’t​

Microsoft’s public wording is precise and procedural: the company has published a CSAF/VEX-style attestation covering the Azure Linux distribution and has stated it will update the CVE record if additional Microsoft products are identified as packaging the affected library. That published attestation is a machine‑readable statement of what Microsoft has validated so far; it is not a universal guarantee that other Microsoft services or images cannot include PyTorch 2.8.0. Treat the Azure Linux attestation as an authoritative signal for that product only until Microsoft explicitly expands the mapping.
Why vendors publish attestations this way:
  • Large vendors ship many images, containers, curated environments and runtime artifacts. Publishing the product scope they have already inventory‑checked and attested provides customers with a deterministic automation input (reduce triage time).
  • Inventorying every image, curated environment, and managed runtime is work that proceeds in phases; vendors frequently publish the scope as it is validated and then expand it. Microsoft explicitly commits to that phased mapping in public messaging.

Is Azure Linux the only Microsoft product that includes PyTorch and could be affected?​

Short answer: No — but Azure Linux is the only Microsoft product Microsoft has publicly attested as part of its VEX/CSAF publication for this CVE so far. That is, Microsoft has declared Azure Linux in scope for its attestation, but this does not technically preclude other Microsoft products or published images from shipping PyTorch and therefore being potentially affected.

Evidence and reasoning​

  • Microsoft‑published, curated images and services are known to include PyTorch: Azure Machine Learning’s Azure Container for PyTorch (ACPT) and Azure ML curated environments explicitly package PyTorch for training and inference workloads, and Microsoft documents and markets these images as containing “the latest PyTorch” for Azure ML customers. If ACPT or other curated images in your environment included PyTorch 2.8.0 (or an affected build) at the time of image creation, those images are operational carriers until updated.
  • Third‑party managed runtimes that run inside Azure but are distinct product surfaces — such as Databricks Runtime for Machine Learning — ship preinstalled PyTorch binaries in machine learning runtimes and therefore represent a plausible presence of the vulnerable library until their runtimes are updated. Databricks explicitly lists PyTorch among the preinstalled libraries for Databricks Runtime for Machine Learning, so customers running those runtimes should check the specific runtime version’s included PyTorch package.
  • Microsoft’s own past outputs and public artifacts (PyTorch builds for Windows, DSVM images, container layers published to registries) create additional surfaces where PyTorch could be present. Those images, curated environments, or published wheels are not necessarily covered by an Azure Linux attestation unless Microsoft enumerates them as such.
Because presence is a build‑time artifact (what package version was baked into an image), the only reliable way to know if a specific Microsoft product, runtime, or published image is affected is to do image‑ or host‑level verification.

What Microsoft’s wording means operationally for customers​

  • If you run Azure Linux base images supplied by Microsoft: you can treat the attestation as a definitive, machine‑readable input for that product and follow Microsoft’s VEX guidance and any patches they publish for Azure Linux.
  • If you run Azure ML curated images / ACPT: these are separate artifacts with their own update cadence. ACPT is explicitly built to ship PyTorch and related ML stacks; customers using those curated images must inspect the curated image metadata or run runtime checks inside the image to confirm the PyTorch version and apply updates as necessary.
  • If you run Databricks runtimes, marketplace images, partner/container images, or Windows/DSVM images: each of these can independently include PyTorch. Microsoft’s Azure Linux attestation does not automatically cover these surfaces — you must verify each image or runtime you use. Databricks runtime release notes document preinstalled PyTorch in their ML runtimes.

Recommended verification and remediation playbook (operational checklist)​

The single reliable indicator of “affected” is the actual PyTorch binary and version present in an image or environment. The following checklist is designed to be deterministic and repeatable.

1) Inventory: find where PyTorch might be present​

  • Enumerate the following in your estate:
  • Azure ML workspaces, curated environments, and ACPT image tags.
  • AKS/AKS‑based containers used for training/inference.
  • VM images (Marketplace, DSVM, custom snapshots), WSL2 kernels and custom WSL kernels.
  • Databricks clusters and the Databricks Runtime version in use.
  • Container registries (ACR, Docker Hub) holding curated or partner images.
  • CI/CD runners and image‑build pipelines that publish training images.

2) Verify installed PyTorch version (image / container / host)​

  • For a running Python environment:
  • python -c "import torch; print(torch.version)"
  • pip show torch
  • For containers or images: run the image (or inspect Dockerfile/manifests) and perform the same checks.
  • For packaged distributions: check distro package names and changelogs (apt/rpm) to map package versions to upstream PyTorch versions.

3) If you find PyTorch 2.8.0 (or an affected build)​

  • Prioritize images used in multi‑tenant, internet‑facing services, shared notebooks, model hosting endpoints, and CI runners.
  • If a patched upstream PyTorch release exists, upgrade to a patched wheel and rebuild your images and artifacts.
  • If immediate upgrade is infeasible, apply compensating controls:
  • Avoid or refactor code paths that call torch.nan_to_num(....long in compiled/Inductor flows; run such paths in eager mode as a short‑term mitigation.
  • Run suspect images in isolated single‑tenant VMs while you patch.
  • Enforce image signing and provenance so out‑of‑date images are not redeployed by mistake.

4) For managed runtimes (Databricks, Azure ML)​

  • Map runtime version to included libraries (Databricks release notes / Azure ML curated image metadata).
  • If the managed runtime includes affected PyTorch, switch to a runtime that lists a patched PyTorch or install an updated wheel at cluster start time using init scripts.

5) Automated detection & CI controls​

  • Add a CI gate that runs the simple version check (python -c "import torch; print(torch.version)") in image build pipelines.
  • Add regression/unit tests that exercise torch.nan_to_num + .long in both eager and compiled modes to detect divergence early. This helps detect silent correctness regressions before they reach production.

Which sources confirm the vulnerability and Microsoft’s statement?​

  • The upstream PyTorch issue that reproduces the faulty compiled behavior is published in the PyTorch GitHub issue tracker and shows the exact repro, symptoms and developer triage threads.
  • Major trackers (NVD) and distribution trackers list CVE‑2025‑55554 and summarize the defect as an integer overflow in torch.nan_to_num-.long.
  • Debian’s security tracker has ingested the CVE and lists package versions for various Debian releases as vulnerable or needing fixes. This shows that downstream distributions have observed the issue and are tracking remediation.
  • Microsoft’s public attestation and transparency statement (MSRC + VEX/CSAF commitment) for Azure Linux is documented in Microsoft’s advisory/attestation messaging; that attestation is the formal instrument Microsoft used to state the validated scope (Azure Linux) and commit to expand mapping if additional products are found to ship the affected component.

Critical analysis — strengths, limitations and risks in Microsoft’s messaging​

Strengths​

  • Microsoft publishing machine‑readable CSAF/VEX attestations for Azure Linux is a concrete step toward automation and transparency: enterprise security tooling can ingest these attestations and make deterministic decisions for the exact product Microsoft validated. That materially reduces triage time for Azure Linux customers and supports automated remediation pipelines.
  • Microsoft’s promise to update the CVE record if additional products are discovered provides a clear procedural commitment and an observable place customers can monitor for scope expansion. This is preferable to silence or an ambiguous statement.

Limitations and risks​

  • Attestation scope vs. ecosystem presence: an attestation is a statement of what the vendor has validated — not a proof of absence elsewhere. Microsoft’s Azure Linux attestation does not automatically cover other Azure artifacts, curated images, managed runtimes (ACPT) or third‑party images that run on Azure. Microsoft’s wording is procedural and accurate, but security teams must not treat the attestation as exhaustive.
  • Long tail problem: curated images, marketplace appliances, and third‑party partner images can lag or be rebuilt with differing dependency sets, creating a “long tail” of potential exposure that requires targeted discovery and remediation.
  • Silent correctness risk: because this class of defect produces incorrect numeric outputs (not always crashing behavior), standard exploit detection (telemetry for crashes, memory corruption alerts) may not flag affected runs. Organizations must emphasize regression tests and data‑integrity monitoring to detect miscomputations.

Recommended short‑term mitigations (when patching is delayed)​

  • Refactor or avoid torch.nan_to_num(....long in compiled code paths; run those specific transforms in eager mode where behavior is controlled.
  • Enforce strict image provenance: only deploy curated images from trusted registries and block older, unsigned images.
  • For Databricks or managed runtimes, prefer runtime versions that explicitly document a patched PyTorch or allow injecting a patched wheel at cluster startup.
  • Add unit tests to catch numeric divergence between eager and compiled modes; use these tests in CI to prevent regressions from being promoted.

Practical verification commands & CI snippets​

  • Quick version check inside a running interpreter:
  • python -c "import torch; print(torch.version)"
  • Docker container check:
  • docker run --rm -it <image> python -c "import torch; print(torch.version)"
  • Example CI step (shell):
  • docker pull $IMAGE
  • docker run --rm $IMAGE python -c "import torch; print(torch.version)"
  • if version == '2.8.0': fail_build
  • Regression test (Python, CI) skeleton that compares eager vs compiled:
  • import torch
    x = torch.tensor([[float('inf')])
    def model(x):
    x = torch.nan_to_num(x, nan=0, posinf=torch.iinfo(torch.int64).max, neginf=torch.iinfo(torch.int64).min)
    return x.long
    eager = model(x)
    compiled = torch.compile(model)(x)
    assert torch.equal(eager, compiled), "Eager vs compiled mismatch detected"

Closing assessment​

Microsoft’s attestation that Azure Linux “includes this open‑source library and is therefore potentially affected” is an accurate representation of the product scope Microsoft has validated and published via CSAF/VEX. That attestation is a valuable, machine‑readable input that Azure Linux customers should act on immediately.
However, Azure Linux is not the only Microsoft product that can or does include PyTorch. Microsoft publishes and supports multiple PyTorch‑carrying artifacts — Azure ML curated images (ACPT), Databricks Machine Learning runtimes running in Azure, published wheels for Windows and DSVMs, and various container images — any of which may have included PyTorch 2.8.0 at build time and therefore deserve host‑ or image‑level verification. Customers must not conflate Microsoft’s attested scope with the universe of Microsoft‑distributed images and runtimes; instead, they must perform deterministic inventory and verification and apply patches or mitigations as appropriate. Action summary for operators:
  • Treat Microsoft’s Azure Linux attestation as authoritative for Azure Linux and follow its VEX guidance.
  • Inventory Azure ML/ACPT, Databricks runtimes, marketplace and curated images, WSL/DSVMs, and container registries in your estate.
  • Verify runtime torch.version in each image/host; if you find PyTorch 2.8.0, prioritize patching or apply compensating controls.
  • Add CI checks and regression tests to detect divergence between eager and compiled execution, and adopt image‑provenance controls (signing, pinned registries).
Where an immediate vendor‑patched PyTorch wheel is not yet available for your runtime, the pragmatic steps are isolation, refactoring the specific code path, and rebuilding images with a patched wheel as soon as one is published. Public trackers and the upstream GitHub issue remain the primary places to watch for an official upstream patch; distribution and vendor advisories will follow with mapped package versions and replacement artifacts. In short: Azure Linux is the product Microsoft has attested for CVE‑2025‑55554, but it is not the only Microsoft‑managed artifact where PyTorch may appear — and every image, runtime, and managed environment running PyTorch should be verified host‑by‑host until all affected builds have been patched.

Source: MSRC Security Update Guide - Microsoft Security Response Center
 

Back
Top