NVIDIA’s new Rubin platform, unveiled at CES 2026, promises to redraw the economics and architecture of large-scale inference and agentic AI by combining a six‑chip, rack‑scale co‑design with a new AI‑native storage layer — and with headline claims of up to 10× lower inference cost and dramatically higher tokens‑per‑second for long‑context reasoning. The announcement positions Rubin as a direct continuation of NVIDIA’s extreme co‑design strategy (CPU + GPU + DPU + fabric + storage), and it arrives with broad partner commitments from cloud providers, hyperscalers and AI labs — even as independent verification and full product availability remain subject to rollout schedules and third‑party benchmarks.
The Rubin platform is NVIDIA’s next rack‑scale push aimed at the inference‑and‑agent era of AI. It stitches together custom compute, next‑generation HBM memory, a high‑performance CPU, and an upgraded DPU/SuperNIC and switch fabric into a single coherent system designed to serve extremely large context windows, mixture‑of‑experts (MoE) models and multi‑agent orchestration at scale. NVIDIA presented Rubin at CES 2026 as the successor to the Blackwell/GB family and the practical architecture for “agentic AI” workloads that require both vast memory and deterministic, high‑throughput inference. Key public claims from the launch and accompanying materials include:
Caveats and risk factors:
Practical considerations:
Source: Technobezz Nvidia Launches Rubin AI Platform With 10x Lower Inference Costs
Background / Overview
The Rubin platform is NVIDIA’s next rack‑scale push aimed at the inference‑and‑agent era of AI. It stitches together custom compute, next‑generation HBM memory, a high‑performance CPU, and an upgraded DPU/SuperNIC and switch fabric into a single coherent system designed to serve extremely large context windows, mixture‑of‑experts (MoE) models and multi‑agent orchestration at scale. NVIDIA presented Rubin at CES 2026 as the successor to the Blackwell/GB family and the practical architecture for “agentic AI” workloads that require both vast memory and deterministic, high‑throughput inference. Key public claims from the launch and accompanying materials include:- A six‑chip extreme co‑design: Vera CPU, Rubin GPU, NVLink 6 Switch, ConnectX‑9 SuperNIC, BlueField‑4 DPU, and Spectrum‑6 Ethernet Switch.
- Rack products: the Vera Rubin NVL72 (and NVL144/NVL576 variants in roadmap materials) that combine 72 Rubin GPUs + 36 Vera CPUs in a tight NVLink domain delivering massive intra‑rack bandwidth (NVIDIA’s spec sheets and press materials cite figures such as 260 TB/s for the NVL72 style interconnect).
- Per‑GPU inference performance: each Rubin GPU is claimed to provide up to 50 petaflops of NVFP4 inference compute — a multi‑fold jump over the prior Blackwell family in FP4 inference density.
- Storage and memory re‑thinking: a new Inference Context Memory Storage Platform — powered by BlueField‑4 DPUs and AI‑native shared storage — that shifts key‑value (KV) caches away from expensive GPU HBM and into a shared, low‑latency storage fabric, with vendor claims of ~5× higher tokens/sec and ~5× better power efficiency versus traditional storage approaches.
- Commercial claims: NVIDIA and partners assert 10× lower inference cost per token for MoE and certain long‑context workloads when compared to the previous generation, plus substantial reductions in GPU counts required for training MoE models. These are framed as vendor and partner performance claims during the Rubin launch.
Why Rubin matters: the inference economics problem
Inference has become a larger portion of overall AI operating expense as models move from one‑off research runs to continuous, user‑facing services. For enterprises and platforms, recurring per‑token costs — not just one‑time training invoices — determine whether a product is commercially viable.- Modern production services run millions to billions of tokens per day; incremental token cost compounds quickly.
- Memory (HBM, LPDDR at CPU) and NVLink coherency are the gating factors for long‑context reasoning: when you expand a model’s context window from tens of thousands of tokens to millions, naive GPU memory strategies balloon cost and reduce parallelism.
- MoE architectures, which activate only a subset of specialists per token, behave differently on hardware: they can be far more compute‑efficient but are sensitive to interconnect latency, memory locality, and runtime scheduling.
Technical deep dive: the six‑chip codesign (what’s new)
Vera CPU — the system coordinator
- Architecture and role: Vera is NVIDIA’s custom server CPU that replaces (or augments) the Grace family in Rubin systems. The Vera CPU is described as an 88‑core design (with SMT yielding 176 threads in some materials) optimized for tight CPU‑GPU coherency and rack‑scale confidential computing. It is intended to handle orchestration, pre/post‑processing, and secure memory management in mixed CPU+GPU inference pipelines.
- Memory and features: launch materials mention very large LPDDR5x pools per CPU (public reports vary on the exact LPDDR size per node), and specific Rubin system demos showed extensive LPDDR modules integrated on the motherboard to act as part of the fast memory pool. Vera also introduces rack‑scale confidential computing primitives that aim to protect data across CPU, GPU and NVLink domains (vendor claim). Independent documentation and third‑party detail remain limited at launch.
Rubin GPU — inference density and NVFP4
- Design: Rubin is presented as a next‑generation inference GPU tuned for massive context workloads, with multi‑die chiplet packaging and HBM4 memory. NVIDIA’s materials show two reticle‑size die arrangements per GPU in some Rubin variants, and a separate Rubin CPX variant (monolithic) aimed at specific inference accelerators like video decoding/encoding and long‑context attention acceleration.
- Peak inference performance: vendor claims cite ~50 petaflops of NVFP4 per Rubin GPU for inference — a substantial uplift over previous generations and the basis for the company’s per‑token cost claims. Multiple independent outlets reproduced these figures during CES coverage. Readers should treat peak FLOPS figures as architectural ceilings: achievable sustained throughput in production depends on model topology, data movement, and software stack optimization.
NVLink 6 Switch and NVLink fabric
- The platform introduces a next‑generation NVLink 6 switch that scales intra‑rack coherence and offers order‑of‑magnitude improvements in interconnect bandwidth at rack scale. NVIDIA advertises per‑rack NVLink throughput measures in the hundreds of TB/s (260 TB/s appears in several product slides and reporting). The faster NVLink is central to pooling memory and enabling single‑address‑space strategies for very large models.
ConnectX‑9 SuperNIC and Spectrum‑6 switch
- ConnectX‑9 is the next SuperNIC generation with 800‑Gb/s per link and low‑latency RDMA/ROCE semantics for fast movement of KV caches and remote memory pages. Spectrum‑6 provides the Ethernet switching backbone for wider data‑center fabric integration. These network elements make the Rubin platform more than a rack appliance: they’re intended to be a composable, data‑center‑native building block.
BlueField‑4 DPU — offload and AI‑native storage
- BlueField‑4 is NVIDIA’s latest DPU that pairs tightly with ConnectX‑9 and Spectrum‑6 to offload storage, security and data‑movement tasks from the CPU and GPU. Launch commentary emphasised BlueField‑4 as the engine for the Inference Context Memory Storage Platform: running KV cache services, low‑latency replication, and DOCA microservices to present a virtually shared, low‑latency context store to Rubin GPU accelerators. Press materials put the efficiency gains for token throughput and power-efficiency at about 5× compared with traditional CPU+NVMe stacks — a vendor claim that should be validated under production workloads.
System and rack claims — what’s verifiable now
Several headline specs were repeated across technical outlets at CES:- Vera Rubin NVL72 (or NVL144 in some SKU variants): racks composed of 72 Rubin GPUs + 36 Vera CPUs with a quoted aggregate interconnect bandwidth figure of ~260 TB/s. Multiple trade publications and NVIDIA’s own slides used these figures as the basis for exaflops‑scale rack claims.
- Per‑GPU NVFP4 throughput: ~50 petaflops per Rubin GPU is consistently reported in vendor and trade coverage, positioning Rubin as a major leap for FP4 inference density. Independent benchmark submissions from cloud partners and MLPerf-style tests have not yet been published at the time of writing; those will be decisive for adoption decisions.
- Memory architecture: Rubin moves to HBM4 at the GPU side and leverages large LPDDR pools on Vera CPUs plus the shared DPU‑backed context storage. Reported per‑GPU HBM capacities and per‑rack “fast memory” totals vary across slides (some materials cite 288 GB HBM4 per GPU and system memory aggregates of tens to hundreds of terabytes depending on rack SKU). Cross‑verified press coverage repeats these figures but rounds differ by outlet; treat specific capacity numbers as vendor disclosures pending datasheet publication.
The 10× inference cost claim — context and caveats
NVIDIA’s most provocative claim is that Rubin (and, in related messaging, the preceding GB200 NVL72 platform) can reduce per‑token inference cost by roughly 10× for MoE and long‑context workloads when compared to prior generations. The company and ecosystem partners explain the arithmetic as a combination of:- NVFP4 low‑precision tensor formats that reduce compute and memory per token;
- MoE model sparsity that activates fewer parameters per token;
- Much higher on‑chip and rack‑level throughput (more tokens per second for the same power); and
- Moving KV caches off expensive HBM and into an optimized shared DPU/Storage layer to reduce HBM usage and increase GPU availability for compute.
Caveats and risk factors:
- The 10× improvement has been demonstrated for specific MoE topologies and inference pipelines; dense (non‑MoE) or heavily retrieval‑augmented workloads may see different numbers.
- Real production savings depend on end‑to‑end pipeline costs: datastore access latency, network egress, caching layers, and multi‑tenant scheduling reduce headroom versus a lab benchmark.
- Early adopter cloud pricing and provisioning decisions will determine whether those per‑token hardware gains translate into lower bills for end‑users or merely higher margins for providers.
Storage re‑architecture: Inference Context Memory Storage Platform
One of Rubin’s most significant architectural shifts is treating memory as first‑class infrastructure for inference rather than an expensive byproduct of GPU design. Instead of forcing gigantic KV caches into GPU HBM, Rubin’s architecture routes KV caches to an AI‑native, DPU‑managed shared storage layer that is:- Exposed to GPUs with very low latency via ConnectX‑9 and NVLink fabrics;
- Managed by BlueField‑4 DPUs running DOCA microservices and purpose‑built caching policies; and
- Claimed to deliver up to 5× the tokens/sec and 5× better power efficiency for memory‑bound agentic workloads compared to conventional NVMe/CPU driven storage stacks.
Practical considerations:
- Software changes are required across runtimes (TensorRT/Torch/XLA), caching policies, and model serving layers to exploit remote KV caches without incurring tail latency.
- Multi‑tenant fairness, eviction policies and cross‑tenant isolation are non‑trivial at scale; BlueField DOCA microservices will need to prove their production maturity.
- The storage‑as‑memory approach reduces HBM pressure but increases reliance on rack network fabrics and DPUs; network saturation and DPU CPU cycles become new bottlenecks if not engineered carefully.
Ecosystem and partner moves: cloud, labs, and first‑wave hosts
Public launch messaging named major cloud providers (AWS, Google Cloud, Microsoft Azure, OCI) as Rubin adopters in 2026, and several cloud partners (CoreWeave, Lambda, Nebius, Nscale) were in the first wave of host announcements. Microsoft is cited as planning to deploy Vera Rubin NVL‑class systems inside its Fairwater AI “superfactories” and to scale to rack‑farm deployments as a strategic element of Azure’s AI infrastructure. Independent reporting confirms Microsoft’s active investments in NVL systems and earlier Blackwell racks; Rubin adoption is presented as the natural migration path for hyperscaler customers. AI labs and model vendors offered public support in launch materials; quotes attributed to Sam Altman (OpenAI), Dario Amodei (Anthropic), Mark Zuckerberg (Meta), and Elon Musk (xAI/Tesla) highlight the strategic importance of efficient inference platforms for frontier AI progress. Those customer quotes are reproduced in NVIDIA’s launch coverage and media reports; however, the practical economics of public cloud pricing and reserved capacity terms will determine how much of the platform’s theoretical cost reduction is passed to model operators and end customers.Competitive landscape: AMD, others, and the memory crunch
NVIDIA’s Rubin is a direct challenge to AMD’s Helios rack systems (announced earlier) and other emerging architectures. AMD has emphasized higher HBM capacity per socket (50% more HBM4 capacity in some announcements) while NVIDIA counters with higher HBM4 bandwidth per GPU achieved through silicon and interconnect optimizations. The memory market backdrop is also notable: industry reports indicate data‑center demand for DRAM and HBM is straining supply, with data center projects accounting for a large share of DRAM output — making Rubin’s memory‑first approach both timely and risky if supply bottlenecks persist. The competitive takeaway:- AMD’s Helios and other rival racks will be judged not only on peak FLOPS but on usable, real‑world inference cost, memory capacity, and ecosystem support (runtimes, toolchains, SWAP/stack support).
- NVIDIA’s advantage is in integrated software (CUDA, TensorRT, Dynamo) and a wide developer base; rivals must overcome both silicon parity and the software ecosystem gap to capture share.
Strengths, risks and practical consequences for enterprise IT
Strengths and opportunities
- Radical cost improvement potential: If the claimed 10× inference cost reductions hold in production for MoE and long‑context workloads, the economics of deploying agentic systems change dramatically — enabling more aggressive productization and wider availability of advanced models.
- Holistic co‑design: CPU + GPU + DPU + fabric + storage co‑design reduces the mismatch between model demands and hardware capabilities — particularly for memory‑heavy, latency‑sensitive workloads.
- Ecosystem commitment: Major cloud partners and model labs are already aligning to Rubin‑class systems, which should accelerate software optimizations and managed offerings that make the new architecture accessible.
Risks and open questions
- Benchmark and reproducibility risk: Vendor demos show potential; independent, community‑validated MLPerf or equivalent benchmarks are required to confirm performance across representative workloads and multi‑tenant environments. Treat vendor figures as directional until independently verified.
- Supply and cost risk for HBM4/LPDDR: The projected gains require broad availability of HBM4 and large LPDDR pools. Global memory tightness and supply chain constraints could delay deployments or raise costs.
- Software and operational complexity: Moving KV caches into DPU‑managed shared storage and exploiting extreme NVLink fabrics requires changes to runtimes, schedulers, and observability tools. Early adopters will need to absorb platform engineering risk.
- Commercial translation: Hardware efficiency does not automatically equal lower customer bills. Cloud pricing, reserved capacity economics, and provider margins will determine how much cost reduction users actually see.
What to watch next (practical timeline)
- Public vendor datasheets and detailed datasheet corrections — confirm exact HBM4/LPDDR capacities and NVLink switch specs.
- MLPerf (or comparable) inference submissions from multiple providers running representative MoE and long‑context workloads on Rubin hardware. Independent results will be the clearest validation of the 10× claim.
- Cloud provider SKUs and pricing: how Azure, AWS, GCP and OCI package Rubin‑based instances and whether per‑token savings are reflected in list prices.
- BlueField‑4 production maturity and DPU microservice ecosystem growth: DOCA support, partner integrations (WEKA, VAST, DDN, etc., and the operational tooling for managing shared KV caches at scale.
Conclusion: strategic implications for IT and WindowsForum readers
NVIDIA’s Rubin platform represents a clear architectural bet: make memory and interconnect first‑class citizens of inference infrastructure and lean heavily on DPU‑based storage offload to enable vastly longer contexts and more efficient MoE serving. For enterprises, the practical consequences are profound if the platform’s vendor claims hold:- The threshold for deploying advanced agentic systems and very long‑context LLMs could fall sharply, enabling richer, more persistent assistant and agent experiences.
- Procurement strategies will need to balance early access to Rubin‑class performance against the operational complexity of new runtimes and storage topologies.
- Hardware and cloud budgeting should explicitly model the timing of Rubin availability (second half of 2026 / Q3 2026 windows were cited in NVIDIA roadmaps and trade reports) and require vendor‑backed, workload‑specific proofs of value rather than relying solely on vendor peak figures.
Source: Technobezz Nvidia Launches Rubin AI Platform With 10x Lower Inference Costs