Microsoft’s new ND GB300 v6 virtual machines have cracked a milestone that changes the practical limits of public‑cloud AI inference: one NVL72 rack of Blackwell Ultra GPUs sustained an aggregated throughput of roughly 1.1 million tokens per second, a result validated by an independent benchmark lab and reported by multiple outlets.
Microsoft packaged NVIDIA’s latest GB300 NVL72 rack into the Azure ND GB300 v6 VM family and used an MLPerf‑style Llama 2 70B inference setup to demonstrate this throughput. The test runs aggregated 18 ND GB300 v6 VMs inside a single NVL72 rack (72 GB300 GPUs + 36 NVIDIA Grace CPUs in the rack domain) and reported an aggregate of ~1,100,948 tokens/sec; that aggregates to about 15,200 tokens/sec per GPU. The measurement and analysis were publicly discussed by Signal65 and covered by industry press. This result is being framed inside the industry as a generational leap: Microsoft’s ND GB200 v6 family previously set a high bar (reported ~865,000 tokens/sec on a GB200 NVL72 rack), and the GB300 result documents a meaningful uplift over that baseline. Both Microsoft’s posts about ND GB200 and the new GB300 briefings are part of the verification fabric for these claims.
Source: SDxCentral Microsoft’s Blackwell Ultra VMs push AI performance past the million token milestone
Background / Overview
Microsoft packaged NVIDIA’s latest GB300 NVL72 rack into the Azure ND GB300 v6 VM family and used an MLPerf‑style Llama 2 70B inference setup to demonstrate this throughput. The test runs aggregated 18 ND GB300 v6 VMs inside a single NVL72 rack (72 GB300 GPUs + 36 NVIDIA Grace CPUs in the rack domain) and reported an aggregate of ~1,100,948 tokens/sec; that aggregates to about 15,200 tokens/sec per GPU. The measurement and analysis were publicly discussed by Signal65 and covered by industry press. This result is being framed inside the industry as a generational leap: Microsoft’s ND GB200 v6 family previously set a high bar (reported ~865,000 tokens/sec on a GB200 NVL72 rack), and the GB300 result documents a meaningful uplift over that baseline. Both Microsoft’s posts about ND GB200 and the new GB300 briefings are part of the verification fabric for these claims. Why the million‑token milestone matters
Tokens as the currency of inference
Tokens are the operational unit for LLM workloads: models tokenize inputs and generate tokens one or more at a time. For enterprises, tokens per second (TPS) is now a first‑order performance metric because it maps directly to user concurrency, latency, and — increasingly — cost. Billing models and capacity planning use token consumption to size systems, so raising sustained TPS fundamentally alters cost and scale calculations for conversational agents, retrieval‑augmented generation (RAG) pipelines, and multi‑step agentic systems.What 1.1M tokens/sec buys an operator
- Support for thousands of concurrent interactive users or very high‑frequency batching for summarization and long‑context reasoning.
- Fewer racks required to reach the same QPS (queries per second) for a given service level — reducing scheduling complexity for the largest models.
- The practical ability to host longer context windows or larger KV caches inside a single coherent domain, improving throughput for reasoning workloads.
The technical anatomy: how Azure delivered 1.1M tokens/sec
Rack‑as‑accelerator architecture
The GB300 NVL72 rack treats the entire rack as a single coherent accelerator: 72 Blackwell Ultra GPUs paired with 36 NVIDIA Grace CPUs, pooled HBM/“fast memory” in the tens of terabytes, and a high‑bandwidth NVLink switch fabric inside the rack. This moves the bottleneck from cross‑host communication to intra‑rack coherence, enabling much larger working sets and lower synchronization overhead for attention‑heavy models.Key hardware numbers reported
- Aggregate throughput on the demonstrated rack: ~1,100,948 tokens/sec.
- Per‑GPU rough average: ~15,200 tokens/sec per GB300 GPU (1.1M ÷ 72).
- Intra‑rack NVLink bandwidth: ~130 TB/s (vendor‑published NVLink/NVSwitch figure for NVL72 configurations).
- Pooled fast memory per rack: ~37–40 TB reported in vendor materials and industry briefings.
Software and numeric formats
The GB300 results lean on software and numeral innovations such as NVFP4 (FP4‑style low‑precision formats), highly tuned inference runtimes (TensorRT‑LLM / NVIDIA Dynamo or similar compiler/serving stacks), and topology‑aware sharding that minimize cross‑device synchronization. Those software layers are as essential as the hardware to reach the reported tokens/sec.Independent validation, benchmarks and caveats
- Signal65 performed a verification analysis and published an assessment that highlights both the technical achievement and the enterprise implications; their write‑up confirms the 1.1M figure as observed in a sustained run and calls out its potential for regulated industries.
- The claim was reported widely in trade coverage and reproduced in MLPerf‑style submission summaries; NVIDIA’s own MLPerf and technical blog posts document Blackwell family gains in inference, reinforcing the direction and scale of improvement.
- MLPerf‑style runs and vendor‑optimized submissions are directional: they prove what is possible under controlled, optimized conditions. They do not automatically translate to identical uplift for every model, prompt shape, tokenizer, or production pipeline.
- Precision tradeoffs (FP4/NVFP4, quantization, sparsity, etc. can affect model quality. Vendors report adherence to MLPerf accuracy thresholds, but end users must validate quality for their particular prompts and safety constraints.
Strengths — what this enables
- Massive concurrent inference: A single rack that can push >1M tokens/sec lets SaaS and platform providers run very high‑QPS services or consolidate hundreds of smaller racks.
- Feasible long‑context reasoning: The pooled memory and NVLink coherence make long KV caches and extended contexts practically usable inside a single rack, which benefits chain‑of‑thought and multi‑step agents.
- Performance‑per‑watt improvements: The vendor and third‑party analyses claim meaningful gains in tokens/sec per watt relative to previous generations, improving the economics and sustainability profile at scale.
- Faster iteration cycles: For labs training and fine‑tuning very large models, the raw compute and rank‑scale fabric can shorten training wall times when combined with efficient pipelines.
Risks, tradeoffs and operational realities
1) Benchmark vs. production delta
Benchmarks are controlled scenarios. End‑to‑end applications include retrieval latencies, storage I/O, network tail latency, and multi‑tenant noise; those factors frequently halve or worse the headline tokens/sec in practice. Validate with your full pipeline.2) Software and numeric lock‑in
The throughput advantage depends on NVFP4 and vendor inference toolchains. That introduces potential portability and vendor‑stack lock‑in: migrating optimized FP4 stacks to other platforms or maintaining numerical parity across providers can be non‑trivial. Enterprises with regulatory or portability requirements must weigh this carefully.3) Cost and procurement constraints
High density racks consume significant power and cooling resources. List price for such racks is only one part of TCO: power, facility modifications (liquid cooling), specialized networking (Quantum‑X800 InfiniBand + ConnectX‑8), and support contracts are substantial. For many teams, renting managed ND GB300 v6 capacity will be the only realistic path — and that raises concerns about capacity availability and long‑term pricing.4) Operational complexity
To reliably exploit NVL72 coherence you need topology‑aware schedulers, placement guarantees (so a job grabs contiguous NVLink domains), storage that can feed GPUs at high sustained rates, and power/facility planning. This is not turnkey for most IT teams.5) Governance, sovereignty and export controls
Consolidating frontier compute into hyperscalers concentrates power and raises policy questions: data residency, regulated workloads, cross‑border export controls on advanced compute, and national security filters come into play when access to huge model training/serving capacity is centralized. Enterprises in regulated industries must verify compliance features and geographic availability.Practical guidance for enterprise architects
Validate, don’t assume
- Run end‑to‑end pilots using your production prompt mix, retrieval pipeline, and client‑side latency budgets. Simulated MLPerf runs are useful signals, but they’re not a production SLA.
- Measure both Time to First Token (TTFT) and sustained Tokens Per Second (TPS) with real-world retrieval, caching, and multi‑tenant loads. Use the same tokenizer and generation settings your production models will use.
Tech checklist before committing
- Confirm NVLink domain placement guarantees in your region and subscription.
- Validate that model quantization (FP4 or other) preserves acceptable output quality across your prompt set.
- Ensure storage and network bandwidth can sustain the GPUs (I/O starvation is a common practical bottleneck).
- Negotiate SLAs that cover availability and resource reservation for contiguous NVL72 domains if you plan to rely on consistent performance.
Cost/scale modeling
- Model per‑request cost by pairing vendor‑reported tokens/sec with expected real‑world utilization, not vendor peak numbers.
- Consider a hybrid strategy: use ND GB300 racks for peak, low‑latency production traffic and less‑costly instances for background batch or fine‑tuning workloads.
Competitive and market implications
- Hyperscalers that deploy GB300‑class racks gain a practical edge in offering extremely high QPS inference services, and that advantage compounds because large providers can also offer integration, compliance, and geographic reach.
- Specialized GPU clouds and “neoclouds” may respond with differentiated pricing or alternative accelerators, but the rack‑as‑accelerator model prioritizes memory and fabric bandwidth — areas where NVLink + in‑network compute meaningfully advantage Blackwell Ultra family designs.
- Expect continued MLPerf rounds and more independent validation runs; watch for third‑party studies that replicate vendor stacks in non‑vendor‑controlled testbeds.
What to watch next
- MLPerf and independent labs running longer, application‑level tests (RAG, multimodal pipelines, agentic workflows) that include storage and retrieval in the loop. These will help quantify the real production delta.
- Availability of ND GB300 v6 SKUs across Azure regions and the pricing schemes Microsoft offers for provisioned throughput versus on‑demand use.
- Competitor responses: whether other clouds match GB300 (or offer alternative design wins such as TPUs or other accelerators) and how quickly they publish comparable MLPerf entries.
Final analysis and conclusion
Microsoft’s ND GB300 v6 result — a sustained, validated rack‑level throughput of roughly 1.1 million tokens per second on Llama 2 70B — is a real and consequential engineering milestone. It demonstrates that the rack‑as‑accelerator design, paired with new numerical formats and inference compilers, can materially raise the throughput and efficiency ceilings for inference in the public cloud. Signal65’s validation and multiple vendor and press writeups corroborate the headline number and its immediate implications for enterprise‑grade AI deployments. At the same time, the path from benchmark to robust production value is nontrivial. Organizations must validate model quality under FP4/quantized regimes, account for storage and retrieval latencies, negotiate placement guarantees, and evaluate the TCO implications of running or consuming extremely dense rack‑scale instances. The ND GB300 v6 era broadens what’s technically possible; responsible adoption will require disciplined benchmarking, careful SLAs, and architectural work to ensure portability and governance. For teams planning to exploit this new capability, the sensible strategy is pragmatic: use ND GB300 v6 for workloads where the throughput and extended context materially change product capability (high‑concurrency chat, real‑time RAG, multi‑step agent orchestration), validate end‑to‑end performance with real pipelines, and build fallback paths for portability and cost management. The million‑token barrier is broken — the next challenge is turning that headline capability into repeatable, secure, and cost‑effective production services.Source: SDxCentral Microsoft’s Blackwell Ultra VMs push AI performance past the million token milestone
