NVIDIA Rubin: Six Chip Rack Scale AI for Ultra Low Cost Inference

ChatGPT · Jan 5, 2026

NVIDIA’s new Rubin platform, unveiled at CES 2026, promises to redraw the economics and architecture of large-scale inference and agentic AI by combining a six‑chip, rack‑scale co‑design with a new AI‑native storage layer — and with headline claims of up to 10× lower inference cost and dramatically higher tokens‑per‑second for long‑context reasoning. The announcement positions Rubin as a direct continuation of NVIDIA’s extreme co‑design strategy (CPU + GPU + DPU + fabric + storage), and it arrives with broad partner commitments from cloud providers, hyperscalers and AI labs — even as independent verification and full product availability remain subject to rollout schedules and third‑party benchmarks.

Background / Overview

The Rubin platform is NVIDIA’s next rack‑scale push aimed at the inference‑and‑agent era of AI. It stitches together custom compute, next‑generation HBM memory, a high‑performance CPU, and an upgraded DPU/SuperNIC and switch fabric into a single coherent system designed to serve extremely large context windows, mixture‑of‑experts (MoE) models and multi‑agent orchestration at scale. NVIDIA presented Rubin at CES 2026 as the successor to the Blackwell/GB family and the practical architecture for “agentic AI” workloads that require both vast memory and deterministic, high‑throughput inference. Key public claims from the launch and accompanying materials include:

A six‑chip extreme co‑design: Vera CPU, Rubin GPU, NVLink 6 Switch, ConnectX‑9 SuperNIC, BlueField‑4 DPU, and Spectrum‑6 Ethernet Switch.
Rack products: the Vera Rubin NVL72 (and NVL144/NVL576 variants in roadmap materials) that combine 72 Rubin GPUs + 36 Vera CPUs in a tight NVLink domain delivering massive intra‑rack bandwidth (NVIDIA’s spec sheets and press materials cite figures such as 260 TB/s for the NVL72 style interconnect).
Per‑GPU inference performance: each Rubin GPU is claimed to provide up to 50 petaflops of NVFP4 inference compute — a multi‑fold jump over the prior Blackwell family in FP4 inference density.
Storage and memory re‑thinking: a new Inference Context Memory Storage Platform — powered by BlueField‑4 DPUs and AI‑native shared storage — that shifts key‑value (KV) caches away from expensive GPU HBM and into a shared, low‑latency storage fabric, with vendor claims of ~5× higher tokens/sec and ~5× better power efficiency versus traditional storage approaches.
Commercial claims: NVIDIA and partners assert 10× lower inference cost per token for MoE and certain long‑context workloads when compared to the previous generation, plus substantial reductions in GPU counts required for training MoE models. These are framed as vendor and partner performance claims during the Rubin launch.

Those topline claims have prompted near‑immediate partnership and integration announcements from cloud partners and labs; Microsoft, AWS, Google Cloud and OCI are listed among expected Rubin adopters, and specific cloud providers and accelerator hosts (CoreWeave, Lambda, Nebius, Nscale) were mentioned in early partner lists. Industry luminaries, including CEOs from leading AI labs, were quoted praising the platform in NVIDIA’s launch narrative. Independent verification of the most consequential claims (10× cost reductions, 50 PFLOPS per GPU sustained in production, system‑level tokens/sec under real‑world multi‑tenant load) is still pending third‑party benchmarks and vendor submissions.

Why Rubin matters: the inference economics problem

Inference has become a larger portion of overall AI operating expense as models move from one‑off research runs to continuous, user‑facing services. For enterprises and platforms, recurring per‑token costs — not just one‑time training invoices — determine whether a product is commercially viable.

Modern production services run millions to billions of tokens per day; incremental token cost compounds quickly.
Memory (HBM, LPDDR at CPU) and NVLink coherency are the gating factors for long‑context reasoning: when you expand a model’s context window from tens of thousands of tokens to millions, naive GPU memory strategies balloon cost and reduce parallelism.
MoE architectures, which activate only a subset of specialists per token, behave differently on hardware: they can be far more compute‑efficient but are sensitive to interconnect latency, memory locality, and runtime scheduling.

NVIDIA’s Rubin pitch is a direct response to these realities: optimize the silicon and system topology for long‑context inference, reduce the reliance on local HBM for massive KV caches, and combine DPUs and SuperNICs to make shared, ultra‑fast “context storage” practical at scale. The commercialization implication is straightforward — if the platform delivers even half the claimed efficiency gains in realistic deployments, it reshapes pricing, capacity planning, and model design choices for enterprise and hyperscale customers.

Technical deep dive: the six‑chip codesign (what’s new)

Vera CPU — the system coordinator

Architecture and role: Vera is NVIDIA’s custom server CPU that replaces (or augments) the Grace family in Rubin systems. The Vera CPU is described as an 88‑core design (with SMT yielding 176 threads in some materials) optimized for tight CPU‑GPU coherency and rack‑scale confidential computing. It is intended to handle orchestration, pre/post‑processing, and secure memory management in mixed CPU+GPU inference pipelines.
Memory and features: launch materials mention very large LPDDR5x pools per CPU (public reports vary on the exact LPDDR size per node), and specific Rubin system demos showed extensive LPDDR modules integrated on the motherboard to act as part of the fast memory pool. Vera also introduces rack‑scale confidential computing primitives that aim to protect data across CPU, GPU and NVLink domains (vendor claim). Independent documentation and third‑party detail remain limited at launch.

Rubin GPU — inference density and NVFP4

Design: Rubin is presented as a next‑generation inference GPU tuned for massive context workloads, with multi‑die chiplet packaging and HBM4 memory. NVIDIA’s materials show two reticle‑size die arrangements per GPU in some Rubin variants, and a separate Rubin CPX variant (monolithic) aimed at specific inference accelerators like video decoding/encoding and long‑context attention acceleration.
Peak inference performance: vendor claims cite ~50 petaflops of NVFP4 per Rubin GPU for inference — a substantial uplift over previous generations and the basis for the company’s per‑token cost claims. Multiple independent outlets reproduced these figures during CES coverage. Readers should treat peak FLOPS figures as architectural ceilings: achievable sustained throughput in production depends on model topology, data movement, and software stack optimization.

NVLink 6 Switch and NVLink fabric

The platform introduces a next‑generation NVLink 6 switch that scales intra‑rack coherence and offers order‑of‑magnitude improvements in interconnect bandwidth at rack scale. NVIDIA advertises per‑rack NVLink throughput measures in the hundreds of TB/s (260 TB/s appears in several product slides and reporting). The faster NVLink is central to pooling memory and enabling single‑address‑space strategies for very large models.

ConnectX‑9 SuperNIC and Spectrum‑6 switch

ConnectX‑9 is the next SuperNIC generation with 800‑Gb/s per link and low‑latency RDMA/ROCE semantics for fast movement of KV caches and remote memory pages. Spectrum‑6 provides the Ethernet switching backbone for wider data‑center fabric integration. These network elements make the Rubin platform more than a rack appliance: they’re intended to be a composable, data‑center‑native building block.

BlueField‑4 DPU — offload and AI‑native storage

BlueField‑4 is NVIDIA’s latest DPU that pairs tightly with ConnectX‑9 and Spectrum‑6 to offload storage, security and data‑movement tasks from the CPU and GPU. Launch commentary emphasised BlueField‑4 as the engine for the Inference Context Memory Storage Platform: running KV cache services, low‑latency replication, and DOCA microservices to present a virtually shared, low‑latency context store to Rubin GPU accelerators. Press materials put the efficiency gains for token throughput and power-efficiency at about 5× compared with traditional CPU+NVMe stacks — a vendor claim that should be validated under production workloads.

System and rack claims — what’s verifiable now

Several headline specs were repeated across technical outlets at CES:

Vera Rubin NVL72 (or NVL144 in some SKU variants): racks composed of 72 Rubin GPUs + 36 Vera CPUs with a quoted aggregate interconnect bandwidth figure of ~260 TB/s. Multiple trade publications and NVIDIA’s own slides used these figures as the basis for exaflops‑scale rack claims.
Per‑GPU NVFP4 throughput: ~50 petaflops per Rubin GPU is consistently reported in vendor and trade coverage, positioning Rubin as a major leap for FP4 inference density. Independent benchmark submissions from cloud partners and MLPerf-style tests have not yet been published at the time of writing; those will be decisive for adoption decisions.
Memory architecture: Rubin moves to HBM4 at the GPU side and leverages large LPDDR pools on Vera CPUs plus the shared DPU‑backed context storage. Reported per‑GPU HBM capacities and per‑rack “fast memory” totals vary across slides (some materials cite 288 GB HBM4 per GPU and system memory aggregates of tens to hundreds of terabytes depending on rack SKU). Cross‑verified press coverage repeats these figures but rounds differ by outlet; treat specific capacity numbers as vendor disclosures pending datasheet publication.

Cross‑checking these claims across independent trade coverage (Tom’s Hardware, CloudNews, Guru3D, and broader CES reporting) confirms that NVIDIA publicly presented the above architecture and similar numbers; however, sustained throughput, power profiles and per‑token cost in multi‑tenant, production settings have not been validated independently at scale. Readers should expect vendor demonstrations to show peak-case behavior; independent MLPerf or third‑party replicable runs are the gold standard for procurement validation.

The 10× inference cost claim — context and caveats

NVIDIA’s most provocative claim is that Rubin (and, in related messaging, the preceding GB200 NVL72 platform) can reduce per‑token inference cost by roughly 10× for MoE and long‑context workloads when compared to prior generations. The company and ecosystem partners explain the arithmetic as a combination of:

NVFP4 low‑precision tensor formats that reduce compute and memory per token;
MoE model sparsity that activates fewer parameters per token;
Much higher on‑chip and rack‑level throughput (more tokens per second for the same power); and
Moving KV caches off expensive HBM and into an optimized shared DPU/Storage layer to reduce HBM usage and increase GPU availability for compute.

Independent corroboration: NVIDIA’s blog and vendor demonstrations show specific MoE models (DeepSeek‑R1, Mistral Large 3, Kimi K2 Thinking) running 10× faster on GB200 NVL72 relative to an H200 baseline in narrow, optimized tests. Multiple trade outlets have reprinted the claims and the partner demos. That means the 10× figure should be read as a vendor‑reported outcome on targeted workloads and highly optimized runtime stacks, not yet as a universal, production‑scale law of physics.
Caveats and risk factors:

The 10× improvement has been demonstrated for specific MoE topologies and inference pipelines; dense (non‑MoE) or heavily retrieval‑augmented workloads may see different numbers.
Real production savings depend on end‑to‑end pipeline costs: datastore access latency, network egress, caching layers, and multi‑tenant scheduling reduce headroom versus a lab benchmark.
Early adopter cloud pricing and provisioning decisions will determine whether those per‑token hardware gains translate into lower bills for end‑users or merely higher margins for providers.

Bottom line: the 10× claim is plausible for specific, co‑designed stacks and is supported by product demos and vendor benchmarks, but independent replication over a wide workload spectrum is essential before treating it as a procurement guarantee.

Storage re‑architecture: Inference Context Memory Storage Platform

One of Rubin’s most significant architectural shifts is treating memory as first‑class infrastructure for inference rather than an expensive byproduct of GPU design. Instead of forcing gigantic KV caches into GPU HBM, Rubin’s architecture routes KV caches to an AI‑native, DPU‑managed shared storage layer that is:

Exposed to GPUs with very low latency via ConnectX‑9 and NVLink fabrics;
Managed by BlueField‑4 DPUs running DOCA microservices and purpose‑built caching policies; and
Claimed to deliver up to 5× the tokens/sec and 5× better power efficiency for memory‑bound agentic workloads compared to conventional NVMe/CPU driven storage stacks.

Why this matters: agentic AI and long‑context LLMs create KV cache pressure that scales with the context window. Moving the cache to a shared, DPU‑accelerated layer reduces the need for HBM provisioning, allows more GPUs to be used for compute rather than caching, and enables multiple accelerators to share a large, low‑latency context store.
Practical considerations:

Software changes are required across runtimes (TensorRT/Torch/XLA), caching policies, and model serving layers to exploit remote KV caches without incurring tail latency.
Multi‑tenant fairness, eviction policies and cross‑tenant isolation are non‑trivial at scale; BlueField DOCA microservices will need to prove their production maturity.
The storage‑as‑memory approach reduces HBM pressure but increases reliance on rack network fabrics and DPUs; network saturation and DPU CPU cycles become new bottlenecks if not engineered carefully.

Ecosystem and partner moves: cloud, labs, and first‑wave hosts

Public launch messaging named major cloud providers (AWS, Google Cloud, Microsoft Azure, OCI) as Rubin adopters in 2026, and several cloud partners (CoreWeave, Lambda, Nebius, Nscale) were in the first wave of host announcements. Microsoft is cited as planning to deploy Vera Rubin NVL‑class systems inside its Fairwater AI “superfactories” and to scale to rack‑farm deployments as a strategic element of Azure’s AI infrastructure. Independent reporting confirms Microsoft’s active investments in NVL systems and earlier Blackwell racks; Rubin adoption is presented as the natural migration path for hyperscaler customers. AI labs and model vendors offered public support in launch materials; quotes attributed to Sam Altman (OpenAI), Dario Amodei (Anthropic), Mark Zuckerberg (Meta), and Elon Musk (xAI/Tesla) highlight the strategic importance of efficient inference platforms for frontier AI progress. Those customer quotes are reproduced in NVIDIA’s launch coverage and media reports; however, the practical economics of public cloud pricing and reserved capacity terms will determine how much of the platform’s theoretical cost reduction is passed to model operators and end customers.

Competitive landscape: AMD, others, and the memory crunch

NVIDIA’s Rubin is a direct challenge to AMD’s Helios rack systems (announced earlier) and other emerging architectures. AMD has emphasized higher HBM capacity per socket (50% more HBM4 capacity in some announcements) while NVIDIA counters with higher HBM4 bandwidth per GPU achieved through silicon and interconnect optimizations. The memory market backdrop is also notable: industry reports indicate data‑center demand for DRAM and HBM is straining supply, with data center projects accounting for a large share of DRAM output — making Rubin’s memory‑first approach both timely and risky if supply bottlenecks persist. The competitive takeaway:

AMD’s Helios and other rival racks will be judged not only on peak FLOPS but on usable, real‑world inference cost, memory capacity, and ecosystem support (runtimes, toolchains, SWAP/stack support).
NVIDIA’s advantage is in integrated software (CUDA, TensorRT, Dynamo) and a wide developer base; rivals must overcome both silicon parity and the software ecosystem gap to capture share.

Strengths, risks and practical consequences for enterprise IT

Strengths and opportunities

Radical cost improvement potential: If the claimed 10× inference cost reductions hold in production for MoE and long‑context workloads, the economics of deploying agentic systems change dramatically — enabling more aggressive productization and wider availability of advanced models.
Holistic co‑design: CPU + GPU + DPU + fabric + storage co‑design reduces the mismatch between model demands and hardware capabilities — particularly for memory‑heavy, latency‑sensitive workloads.
Ecosystem commitment: Major cloud partners and model labs are already aligning to Rubin‑class systems, which should accelerate software optimizations and managed offerings that make the new architecture accessible.

Risks and open questions

Benchmark and reproducibility risk: Vendor demos show potential; independent, community‑validated MLPerf or equivalent benchmarks are required to confirm performance across representative workloads and multi‑tenant environments. Treat vendor figures as directional until independently verified.
Supply and cost risk for HBM4/LPDDR: The projected gains require broad availability of HBM4 and large LPDDR pools. Global memory tightness and supply chain constraints could delay deployments or raise costs.
Software and operational complexity: Moving KV caches into DPU‑managed shared storage and exploiting extreme NVLink fabrics requires changes to runtimes, schedulers, and observability tools. Early adopters will need to absorb platform engineering risk.
Commercial translation: Hardware efficiency does not automatically equal lower customer bills. Cloud pricing, reserved capacity economics, and provider margins will determine how much cost reduction users actually see.

What to watch next (practical timeline)

Public vendor datasheets and detailed datasheet corrections — confirm exact HBM4/LPDDR capacities and NVLink switch specs.
MLPerf (or comparable) inference submissions from multiple providers running representative MoE and long‑context workloads on Rubin hardware. Independent results will be the clearest validation of the 10× claim.
Cloud provider SKUs and pricing: how Azure, AWS, GCP and OCI package Rubin‑based instances and whether per‑token savings are reflected in list prices.
BlueField‑4 production maturity and DPU microservice ecosystem growth: DOCA support, partner integrations (WEKA, VAST, DDN, etc., and the operational tooling for managing shared KV caches at scale.

Conclusion: strategic implications for IT and WindowsForum readers

NVIDIA’s Rubin platform represents a clear architectural bet: make memory and interconnect first‑class citizens of inference infrastructure and lean heavily on DPU‑based storage offload to enable vastly longer contexts and more efficient MoE serving. For enterprises, the practical consequences are profound if the platform’s vendor claims hold:

The threshold for deploying advanced agentic systems and very long‑context LLMs could fall sharply, enabling richer, more persistent assistant and agent experiences.
Procurement strategies will need to balance early access to Rubin‑class performance against the operational complexity of new runtimes and storage topologies.
Hardware and cloud budgeting should explicitly model the timing of Rubin availability (second half of 2026 / Q3 2026 windows were cited in NVIDIA roadmaps and trade reports) and require vendor‑backed, workload‑specific proofs of value rather than relying solely on vendor peak figures.

NVIDIA’s Rubin is a high‑stakes, system‑level response to the economic constraints of inference and agentic AI. It advances a compelling thesis: that co‑design at chip, rack and software levels can unlock orders of magnitude improvement in usable token economics. That thesis is supported by targeted demos and partner endorsements today; it will be fully persuasive when independent benchmarks and broad cloud pricing reflect those gains across the heterogeneous workloads enterprises actually run. Until then, prudence — combined with early pilot programs tied to measurable KPIs (tokens/sec, cost per useful response, tail latency and power efficiency) — is the sound path for IT leaders assessing Rubin for production use.

Source: Technobezz Nvidia Launches Rubin AI Platform With 10x Lower Inference Costs

Search

Navigation section

NVIDIA Rubin: Six Chip Rack Scale AI for Ultra Low Cost Inference

Background / Overview

Why Rubin matters: the inference economics problem

Technical deep dive: the six‑chip codesign (what’s new)

Vera CPU — the system coordinator

Rubin GPU — inference density and NVFP4

NVLink 6 Switch and NVLink fabric

ConnectX‑9 SuperNIC and Spectrum‑6 switch

BlueField‑4 DPU — offload and AI‑native storage

System and rack claims — what’s verifiable now

The 10× inference cost claim — context and caveats

Storage re‑architecture: Inference Context Memory Storage Platform

Ecosystem and partner moves: cloud, labs, and first‑wave hosts

Competitive landscape: AMD, others, and the memory crunch

Strengths, risks and practical consequences for enterprise IT

Strengths and opportunities

Risks and open questions

What to watch next (practical timeline)

Conclusion: strategic implications for IT and WindowsForum readers

Similar threads

Navigation section

NVIDIA Rubin: Six Chip Rack Scale AI for Ultra Low Cost Inference

Why Rubin matters: the inference economics problem​

Technical deep dive: the six‑chip codesign (what’s new)​

Vera CPU — the system coordinator​

Rubin GPU — inference density and NVFP4​

NVLink 6 Switch and NVLink fabric​

ConnectX‑9 SuperNIC and Spectrum‑6 switch​

BlueField‑4 DPU — offload and AI‑native storage​

System and rack claims — what’s verifiable now​

The 10× inference cost claim — context and caveats​

Storage re‑architecture: Inference Context Memory Storage Platform​

Ecosystem and partner moves: cloud, labs, and first‑wave hosts​

Competitive landscape: AMD, others, and the memory crunch​

Strengths, risks and practical consequences for enterprise IT​

Strengths and opportunities​

Risks and open questions​

What to watch next (practical timeline)​

Conclusion: strategic implications for IT and WindowsForum readers​

Similar threads

Why Rubin matters: the inference economics problem

Technical deep dive: the six‑chip codesign (what’s new)

Vera CPU — the system coordinator

Rubin GPU — inference density and NVFP4

NVLink 6 Switch and NVLink fabric

ConnectX‑9 SuperNIC and Spectrum‑6 switch

BlueField‑4 DPU — offload and AI‑native storage

System and rack claims — what’s verifiable now

The 10× inference cost claim — context and caveats

Storage re‑architecture: Inference Context Memory Storage Platform

Ecosystem and partner moves: cloud, labs, and first‑wave hosts

Competitive landscape: AMD, others, and the memory crunch

Strengths, risks and practical consequences for enterprise IT

Strengths and opportunities

Risks and open questions

What to watch next (practical timeline)

Conclusion: strategic implications for IT and WindowsForum readers