Microsoft Azure’s HBv5 virtual machines have reached general availability, delivering a seismic shift in cloud HPC by pairing a custom AMD EPYC 9V64H processor with on‑package HBM3 memory and delivering nearly 7 TB/s of sustained memory bandwidth — a change that transforms which classes of HPC and memory‑bound workloads make sense to run in the cloud.
Azure’s HB‑series has been Microsoft’s HPC workhorse for memory‑bandwidth sensitive simulations and MPI‑based clusters since HBv2 launched in 2019. HBv2 used AMD EPYC “Rome”‑class parts and delivered high per‑node bandwidth for its time; HBv3 followed with 3D V‑Cache Milan‑X parts; HBv4 moved to Genoa‑X with even higher core counts. Now HBv5 replaces the DDR5 DIMM model with High‑Bandwidth Memory (HBM3) on‑package for orders‑of‑magnitude increases in streaming throughput. Microsoft’s technical documentation and independent testing describe the HBv5 hardware as an aggregated multi‑die EPYC design (the EPYC 9V64H), configured as a single‑tenant node with SMT disabled by default, large on‑package HBM3 pools in the ~400–450 GB range, and headline STREAM‑like bandwidth numbers in the ~6.7–6.9 TB/s range depending on how they are measured. Phoronix’s early independent runs — matching software stacks across generations — found the HBM3 design delivers massive real‑world speedups on streaming and memory‑bound kernels.
Actionable path forward:
Source: Phoronix The Incredible Evolution Of AMD EPYC HPC Performance Shown In The Azure Cloud - Phoronix
Background / Overview
Azure’s HB‑series has been Microsoft’s HPC workhorse for memory‑bandwidth sensitive simulations and MPI‑based clusters since HBv2 launched in 2019. HBv2 used AMD EPYC “Rome”‑class parts and delivered high per‑node bandwidth for its time; HBv3 followed with 3D V‑Cache Milan‑X parts; HBv4 moved to Genoa‑X with even higher core counts. Now HBv5 replaces the DDR5 DIMM model with High‑Bandwidth Memory (HBM3) on‑package for orders‑of‑magnitude increases in streaming throughput. Microsoft’s technical documentation and independent testing describe the HBv5 hardware as an aggregated multi‑die EPYC design (the EPYC 9V64H), configured as a single‑tenant node with SMT disabled by default, large on‑package HBM3 pools in the ~400–450 GB range, and headline STREAM‑like bandwidth numbers in the ~6.7–6.9 TB/s range depending on how they are measured. Phoronix’s early independent runs — matching software stacks across generations — found the HBM3 design delivers massive real‑world speedups on streaming and memory‑bound kernels. Why HBv5 is a real architectural break: HBM3 on package
What changed vs. DDR5 systems
Traditional server designs use DDR5 DIMMs connected over memory channels on the motherboard. That model prioritizes capacity and flexible expansion, but bandwidth per socket is limited by channel width and DIMM configuration. HBv5 flips that tradeoff: it places HBM3 stacks adjacent to CPU chiplets on the same package, greatly increasing raw streaming bandwidth at the cost of per‑node capacity and per‑GB cost. The difference is not incremental — it moves bandwidth units from hundreds of GB/s into the terabytes per second range.Packaging and core count tradeoffs
Azure’s HBv5 nodes aggregate multiple custom EPYC dies (the EPYC 9V64H configuration used by Microsoft) to reach the per‑node HBM capacity and bandwidth targets. Documents and early reporting show per‑node core counts in the 352–368 Zen 4 cores range (SMT off) and base+boost frequencies tuned for sustained throughput (roughly 3.5 GHz base, 4.0 GHz peak on non‑AVX workloads). Microsoft’s node topology exposes many NUMA domains and includes four 200 Gb/s InfiniBand rails (800 Gb/s total) to feed multi‑node MPI jobs.Phoronix benchmarks and independent validation — what the tests show
Test methodology (why the comparison is credible)
Phoronix obtained early access to HBv5 nodes and ran a side‑by‑side comparison of the top‑end VM sizes for HBv2, HBv3, HBv4, and HBv5, using consistent software: Ubuntu 24.04 LTS, the Linux 6.14 kernel, XFS, and GCC 13.3. This matched‑stack approach is important — differences in compiler, kernel latency handling, or filesystem tuning can skew streaming tests. The Phoronix runs focused on a mix of synthetic streaming tests (STREAM‑like kernels) and representative HPC kernels such as CFD loops, molecular dynamics inner loops, and other memory‑bound code paths.Headline empirical outcomes
- Memory‑bound kernels and STREAM‑style tests showed the largest gains: HBv5 frequently measured multiple‑times the throughput of HBv4 and an order‑of‑magnitude over HBv2 in streaming workloads. The HBM3 pool eliminates the DDR5 channel bottleneck that throttled Genoa‑X configurations in high‑traffic streaming patterns.
- Compute‑bound workloads (heavy arithmetic, AVX‑dominated kernels) had mixed results: where vector width, instruction throughput, or large aggregated cache were the limiting factors, HBv5’s Zen 4 core sometimes offered modest gains or parity rather than dominance. In a few cases, a Zen 5‑based server with wider AVX execution paths could be competitive when the workload is not memory bound.
- Determinism and lower jitter: HBv5’s single‑tenant, SMT‑off configuration reduces noisy‑neighbor effects, making scaling and repeatability better for production HPC jobs. That determinism is an operational advantage beyond raw throughput.
Evolution in context: HBv2 → HBv3 → HBv4 → HBv5
Generational snapshot
- HBv2 (2019): EPYC 7002 “Rome” derivatives, up to 120 vCPUs per VM, memory bandwidth ~350 GB/s per VM in vendor descriptions. Tuned for MPI scaling using HDR InfiniBand.
- HBv3 (2022): EPYC 7003 “Milan‑X” with 3D V‑Cache on selected SKUs, same high‑bandwidth focus but achieved via massive cache rather than HBM. Up to ~120 vCPUs for the top sizes, improvements in cache-bound workloads.
- HBv4 (2023): Genoa‑X with the EPYC 9V33X, higher core counts up to 144 vCPUs and greater aggregate compute, still DRAM‑based but tuned for CPU and cache improvements.
- HBv5 (2024/2025 GA): Custom EPYC 9V64H with HBM3, single‑tenant nodes, SMT off, up to 352–368 Zen 4 cores per server and ~400–450 GB HBM3, delivering ~6.7–6.9 TB/s STREAM‑class bandwidth.
Why HBv5 looks like a generational leap for particular workloads
Previous generations improved compute, cache, and single‑thread throughput. HBv5 instead changes the bottleneck itself: by dramatically increasing sustained memory bandwidth, workloads that previously stalled waiting for memory can now achieve throughput closer to the cores’ potential, producing time‑to‑solution reductions that are sometimes multiplicative. For many scientific kernels, that is far more valuable than raw peak FLOPS.Technical deep dive: what the numbers mean in practice
Memory bandwidth: STREAM Triad and the sustained vs. peak distinction
Measured STREAM‑style bandwidth is the most load‑bearing metric here. Microsoft’s published numbers and Phoronix’s independent results report STREAM‑class sustained bandwidth in the 6.7–6.9 TB/s window across the aggregated HBM pool. Different sources report slightly different figures (6.7 TB/s vs 6.9 TB/s); this variance depends on the exact measurement (how many NUMA domains are included, whether the test fully saturates all rails, kernel tuning, and the specific STREAM variant used). Treat the 6.7–6.9 TB/s range as the realistic engineering target for streaming workloads.Capacity vs. bandwidth tradeoff
HBM3 gives stunning bandwidth but costs more per GB and scales less well in raw capacity than DDR5 DIMMs. Microsoft’s HBv5 design targets bandwidth‑saturated workloads by offering ~400–450 GB of HBM3 per node — enough for many HPC kernels but not a substitute for DDR‑backed huge‑memory jobs. For capacity‑limited applications (very large in‑memory datasets not dominated by streaming patterns), DDR‑based instances or other memory architectures may still be preferable.Operational best practices and tuning for HBv5
Kernel, MPI, and NUMA guidance
Microsoft’s HBv5 documentation and Phoronix’s methodology note several practical optimizations that materially affect results:- Use a recent, HPC‑tuned Linux distribution (AlmaLinux 8.10 or Ubuntu 22.04+/24.04 are validated images).
- Apply HPC tuned profiles, enable transparent huge pages (where beneficial), and ensure NUMA balancing is configured for your workload. Microsoft preserves NUMA visibility and exposes multiple NUMA domains per VM — correct process placement and NUMA‑aware allocations matter.
- For multi‑node MPI: use InfiniBand transport tuning (UCX/DC for larger runs) and avoid multi‑rail oversubscription. Microsoft recommends specific UCX env vars for large‑scale jobs in the HBv5 guides.
Practical checklist before migrating workloads
- Profile your workload to identify whether it is memory‑bandwidth bound (STREAM‑like) or compute/cache bound.
- If memory‑bandwidth bound, test HBv5 on a pilot job with identical software builds and data patterns.
- Tune MPI placement, core pinning, and NUMA affinity to maximize sustained throughput.
- Compare time‑to‑solution and cost‑per‑solution, not just per‑hour prices — HBv5’s faster time‑to‑solution can offset higher per‑hour costs for many jobs.
Who should adopt HBv5 — use‑case guidance
- Best fit:
- Computational Fluid Dynamics (CFD), explicit finite element analysis, and other streaming‑memory heavy simulations.
- Inner loops of molecular dynamics and particle codes that read/write large arrays at high rates.
- Legacy CPU‑based HPC codes where rewriting for accelerators is impractical and where memory bandwidth is the present bottleneck.
- Less ideal:
- Workloads that need huge aggregated memory capacity beyond the HBM pool.
- Vector‑heavy kernels that would benefit more from Zen 5’s full 512‑bit AVX execution or from GPU/accelerator offload.
- Windows Server workloads — HBv5 currently lists Linux HPC distributions as the supported guest OS family.
Costs, availability, and vendor considerations
Pricing and single‑tenant design
HBv5 is intentionally single‑tenant and SMT‑disabled to maximize determinism. That means fewer VMs per physical server and a different cost profile than multi‑tenant instances. For many HPC shops, the more important metric is cost‑per‑solution (time×hourly rate), and early reports show HBv5 can offer excellent value for bandwidth‑limited problems thanks to much faster runtimes. Still, teams should benchmark their actual workloads and use conservative on‑demand pricing or negotiated reserved pricing to model costs.Exclusivity and lock‑in questions
HBv5’s EPYC 9V64H is a custom Azure‑only design. That exclusivity gives Microsoft and Azure a competitive differentiation in cloud HPC but also raises questions for organizations that want the same hardware on‑premises or in other clouds. Teams planning long‑term infrastructure roadmaps should weigh the benefits of faster time‑to‑solution against dependence on a single cloud vendor for specific hardware capabilities.Risks, limits, and caveats
Not a universal "faster everything" solution
HBv5 dramatically rebalances the tradeoffs in favor of bandwidth. But that does not automatically make every HPC workload faster. Phoronix’s independent measurements showed workload‑dependent outcomes: streaming kernels often saw dramatic gains, but compute‑bound and large‑cache workloads could see smaller or mixed improvements. This nuance matters for procurement and migration decisions.Software and ecosystem readiness
- Some legacy or vendor‑supplied HPC stacks may make assumptions about memory topology, default NUMA layouts, or driver ecosystems that need tuning for HBv5’s multi‑die, multi‑NUMA structure. Expect some engineering lift for production validation.
- Windows Server is not supported on HBv5; these are Linux‑first HPC nodes. Teams that depended on Windows‑based workflows will need to plan migrations or use hybrid architectures.
Thermal/power and supply considerations
High bandwidth memory and dense die stacking increases power density and thermal complexity. Microsoft’s single‑tenant design and frequency choices are engineering responses to those constraints, but customers should be mindful that some sustained heavy workloads will drive consistent power/thermal draw that impacts long running job scheduling and cost control. Public documentation and early reporting indicate Microsoft tuned core clocks and reserved hypervisor cores explicitly to balance performance and stability.Practical recommendations for IT teams and HPC groups
- Measure first: run a representative microbenchmark (STREAM triad or your workload’s tight inner loops) and a full production job on HBv5 pilot nodes before any large migration.
- Tune for NUMA: use lstopo and pin processes to physical cores/NUMA domains, and follow Microsoft’s recommended MPI/UCX settings for multi‑rail InfiniBand.
- Cost modeling: calculate time‑to‑solution × hourly cost and include engineering migration costs — HBv5 often wins on solution time and may beat cheaper per‑hour instances once end‑to‑end runtime is considered.
- Maintain portability: where possible, keep performance‑portable code and benchmarks so you can compare HBv5 to on‑prem and other cloud offerings without large rewrites.
- Validate numerical fidelity: faster memory and concurrency can change rounding and ordering effects in parallel reductions; verify that results remain scientifically valid at scale.
Critical analysis — strengths, blind spots, and strategic implications
Major strengths
- Transformational bandwidth: HBv5’s HBM3 pool moves sustained memory bandwidth into the terabyte range, unlocking genuine time‑to‑solution improvement for many streaming kernels.
- Deterministic single‑tenant operation: SMT off and single VM per server removes noisy neighbor variability, which is vital for reproducible production HPC pipelines.
- Balanced system design: pairing HBM3 with high‑speed InfiniBand and large NVMe gives a practical platform for scaling MPI jobs beyond a single node.
Potential risks and blind spots
- Capacity limitations: HBM3 remains expensive and limited in absolute capacity relative to DIMM banks — not every memory‑heavy problem will fit or benefit.
- Workload specificity: performance gains are highly workload dependent; compute‑bound codes and those optimized for cache hierarchies may not see the headline wins.
- Vendor exclusivity: the custom EPYC design is exclusive to Azure. That creates a vendor lock‑in potential for teams that adopt HBv5‑specific tuning and workflows.
- Migration engineering: realistic adoption often requires significant tuning, MPI placement adjustments, and validation to translate synthetic benchmark wins into application‑level speedups.
Conclusion — where HBv5 matters most and how to act
Azure HBv5 is a purposely engineered platform: it trades per‑GB cost and raw capacity for an unprecedented increase in sustained memory bandwidth and deterministic performance for HPC workloads. For organizations running streaming, memory‑bandwidth bound jobs that are expensive to rewrite for accelerators, HBv5 can be transformational — delivering multiplicative reductions in time‑to‑solution and materially lowering cost‑per‑experiment when properly tuned. For compute‑bound or massive‑capacity workloads, the benefits are more mixed and require careful validation.Actionable path forward:
- Profile your application to confirm whether memory bandwidth is the bottleneck.
- Pilot HBv5 with matched software stacks and performance tests (use the same compilers, kernels, and MPI settings Phoronix used as a baseline).
- Evaluate time‑to‑solution and total cost, factoring in engineering migration effort.
- If migrating, invest in NUMA‑aware placement, UCX/InfiniBand tuning, and repeated validation runs to ensure both performance and numerical fidelity.
Source: Phoronix The Incredible Evolution Of AMD EPYC HPC Performance Shown In The Azure Cloud - Phoronix