Storage First Pi Record: 314 Trillion Digits on a Single Dell R7725

  • Thread Author
A neon-blue data center rack with glowing drives, a CoolIT cooling unit, and streaming digital numbers.
StorageReview’s lab has reset the high-water mark for brute-force numerical computing: the team computed π to 314 trillion digits using a single 2U server, a four‑month uninterrupted run that makes a pointed argument about where real-world bottlenecks live in extreme-scale computations and why storage bandwidth — not just raw CPU FLOPS — often determines success. StorageReview’s write-up details the hardware, software tuning, and the trade-offs that let a single Dell PowerEdge R7725 outperform prior multi-node and cloud-backed efforts, and independent coverage confirms the headline and the configuration.

Background / Overview​

Historically, pushing π to ever-larger digit counts started as a CPU and floating‑point bragging right, but the contest has steadily migrated into the realm of system architecture. As the digit counts moved from billions to trillions, the runtime behavior of multi-precision libraries and checkpointing logic forced builders to reckon with memory capacity, NUMA topology, and — critically — storage I/O and throughput.
StorageReview’s 314T run demonstrates that at the scale of hundreds of trillions of digits, the limiting resource is frequently the ability to sustain continuous, high‑bandwidth read/write patterns without stalling the large number of CPU cores performing multi-precision arithmetic. The team deliberately chose a single‑server approach to prove a point: with enough local NVMe bandwidth and careful system tuning, a single well-engineered machine can beat distributed clusters for this workload in both speed and energy efficiency. StorageReview’s findings have been independently reported and summarized by other outlets, which picked up the key technical claims: a run that finished in approximately 110 days, used 40 × 61.44 TB Micron 6550 Ion NVMe SSDs, dual AMD EPYC ‘192‑core’ processors (384 physical cores total), 1.5 TB of DDR5, and a tuned Ubuntu Server stack. Those same reports reiterate StorageReview’s central thesis: storage bandwidth and predictable I/O matter more than sheer CPU count for y‑cruncher at this scale.

The record run — hardware and software in brief​

Platform and raw capacity​

  • Server: Dell PowerEdge R7725 (2U), 40‑bay Gen5 E3.S backplane.
  • CPUs: Dual AMD EPYC 9965 (advertised as 192 cores per CPU, 384 cores total) driving heavy parallel multi‑precision arithmetic.
  • System memory: 1.5 TB DDR5 — a generous working set to reduce CPU stalls.
  • Storage: 40 × Micron 6550 Ion Gen5 NVMe drives, 61.44 TB each — total raw capacity ~2.45–2.5 PB; StorageReview reports 34 drives allocated to y‑cruncher for swap and scratch and 6 drives used for a RAID10 output volume.
The central architectural note: the R7725 backplane is direct‑connect to CPU PCIe lanes rather than routing each drive through a backplane PCIe switch. That implies each drive receives 2–4 PCIe lanes depending on configuration, but StorageReview observed aggregate sustained read/write capability at levels in the hundreds of GB/s — they measured peaks near ~280 GB/s aggregate when all 40 bays were exercised in parallel. This bandwidth figure underpins the entire result: y‑cruncher’s multi‑terabyte checkpoints and scratch traffic are extremely sensitive to sustained sequential throughput.

Software and tuning​

  • Application: y‑cruncher v0.8.6.9545 using the Chudnovsky algorithm, the contemporary choice for record π computations.
  • OS: Ubuntu 24.04.2 LTS Server (StorageReview reported moving from Windows Server to Ubuntu yielded measurable I/O stability gains).
  • Cooling: OEM air cooling swapped for a CoolIT AHx10 (liquid cooling CDU, cold plates) to maintain higher steady-state clocks and stable thermal conditions — a measurable win for multi‑month sustained throughput.
  • NUMA and kernel tuning: StorageReview reserved a handful of cores for system tasks, tuned NUMA placement and the scratch array layout to match y‑cruncher’s IO patterns, and tuned filesystem and block parameters to avoid stalls under massive streaming workloads.

Run metrics (headline numbers)​

  • Total digits: 314,000,000,000,000 (314 trillion).
  • Wall time: 110 days (≈101.8 compute days) of uninterrupted operation.
  • Disk activity: StorageReview reports ~132 PiB read and ~112 PiB written logical totals during the run; peak logical scratch usage reached ~1.43 PiB. SMART counters registered ~7.3 PB written per drive (massive write amplification and endurance load).
Two important cross-checks: Tom’s Hardware summarized the StorageReview coverage and echoed the same hardware and runtime numbers; Tech press articles picked up the energy-efficiency framing, contrasting this single‑server run with previous multi‑node efforts. That independent consistency strengthens confidence in StorageReview’s reported metrics.

Why storage bandwidth matters more than you expect​

The IO-dominated regime of extreme π calculations​

At small digit counts, π computations are limited by CPU instruction throughput and floating-point speed. At hundreds of trillions of digits, the algorithm (Chudnovsky as implemented by y‑cruncher) needs gigabytes-to-petabytes of scratch and checkpoint files to hold intermediate multi‑precision blocks. The pattern is heavy sequential streaming with periodic large checkpoints — a workload that punishes:
  • high-latency metadata operations,
  • storage systems that sacrifice sustained sequential throughput for random IOPS,
  • architectures that route many NVMe devices through a limited PCIe switch, which creates per‑drive contention.
The lesson StorageReview demonstrates is blunt: if your storage cannot sustain large, balanced read and write throughput without stalls, no amount of CPU parallelism will hide the I/O bottleneck. The project’s engineers designed a single-server topology that maximized direct PCIe connectivity to drives, reduced interconnect complexity, and tuned firmware/OS stacks for continuous streaming. The result was a faster wall-clock time and lower energy bill compared with a distributed cluster approach.

Direct-connect vs switched backplanes​

Dell’s newer 17th‑generation PowerEdge backplanes reverted to direct PCIe connections per drive rather than a shared switch fabric on the backplane. That design gives each bay a deterministic lane allocation (2–4 lanes per SSD in this configuration), which on a system with sufficient CPU PCIe lanes can yield stunning aggregate bandwidth when all bays are saturated. StorageReview measures that advantage in raw GB/s improvements over earlier switch‑backplane systems and explicitly credits that change as a key enabler for the single‑server record. This is a crucial hardware nuance that matters to anyone planning large scratch workloads: architecture of the drive backplane is not an afterthought.

Strengths: why the single‑server approach is compelling​

  • Simplicity and predictability: A single physical server eliminates network jitter and distributed checkpoint complexity. That reduces failure modes and simplifies NUMA / filesystem tuning.
  • Superior energy efficiency at scale: StorageReview reports ~4,305 kWh consumed across the run, translating to a lower kWh per trillion digits metric compared to prior large-cluster efforts. The single-node setup cut both the energy and operational complexity.
  • Cost and operational footprint: Running one 2U chassis is materially cheaper than operating a multi-rack cluster with dedicated networking and rack-scale power/cooling for the same task.
  • Demonstrates a real‑world engineering pattern: Many long-running scientific workloads rely on balanced I/O rather than raw core count. The π run highlights patterns that map directly to climate models, genomics pipelines, and certain AI workloads — namely, that balanced local bandwidth and low-latency IO are as important as CPU throughput.

Risks, caveats, and contested points​

Single point of failure and resiliency trade-offs​

StorageReview explicitly ran the scratch array in JBOD (no redundancy) and accepted the risk because restarting a failed compute from scratch would have been expensive in time and wear. They protected the final output with a software RAID10 volume, but the scratch pattern had no redundancy. That approach is defensible for a record attempt — it minimizes overhead and maximizes throughput — but it is a deliberate trade-off: any undetected silent error, firmware quirk, or controller hiccup during the 110‑day run could have required a full restart. This is a practical and reproducible risk profile, but it’s not a pattern to emulate for production jobs where data loss is catastrophic.

Drive endurance and operational stress​

The reported per‑drive writes (SMART numbers) are staggering: multiple petabytes written per SSD. Even enterprise‑grade Gen5 NAND devices endure heavy wear under continuous streaming, and the endurance and telemetry profiles reported by StorageReview suggest that this workload pushes drives toward their endurance limits. For labs or enterprises considering similar runs, be prepared for accelerated device retirement and stringent monitoring.

Validity of “world record” claims and comparability​

“World record” language in π computation is mostly community‑driven, and different runs have different validation approaches (checksum verification, third-party attestations, and reproducibility). StorageReview completed checks and published internal verification data; independent press corroborated the numbers. Still, comparisons across runs can be apples-to-oranges if one run emphasizes redundancy or external verification. The technical community typically accepts StorageReview’s reporting as credible, but the definitional nuance should be kept in mind when comparing records.

Environmental and opportunity-cost critique​

Even though StorageReview’s run is more energy efficient per digit than prior approaches, a societal-level critique about the value of computing enormous numbers of π digits remains. The technical counterpoint holds that the exercise functions as a stress test for systems and surfaces practical engineering lessons for long-running scientific computations. The value proposition therefore sits in infrastructure learning rather than immediate mathematical utility. That framing is important when justifying the energy and component cost — something StorageReview highlights in its analysis.

Practical advice for engineers planning a similar run​

  1. Hardware selection
    • Prioritize platforms with direct PCIe lanes to drive bays (avoid backplane PCIe switch contention if you can).
    • Choose high-endurance, enterprise NVMe devices with plenty of overprovisioning and stable firmware.
  2. Storage topology
    • Dedicate a large set of drives to scratch/swap and isolate final outputs on a resiliency volume (RAID10 or similar).
    • Consider filesystem and block-layer tuning for large sequential streams (queue depths, I/O scheduler, large block sizes).
  3. Cooling and thermals
    • Plan for continuous high-power states: liquid cooling or high-capacity CDUs pay dividends in sustained clock stability.
  4. OS and kernel tuning
    • Use a modern Linux server kernel tuned for heavy sequential IO (StorageReview used Ubuntu 24.04.2 and observed better I/O behavior vs Windows Server in their tests).
    • Reserve a handful of cores for system tasks; use NUMA-aware placement.
  5. Monitoring and checkpoints
    • Implement robust drive telemetry monitoring and frequent verified checkpoints to minimize restart costs.
    • Plan your restart strategy and runbook in advance; decide your tolerance for JBOD vs redundancy.
  6. Endurance and spare parts
    • Factor accelerated wear into procurement budgets; have spare identical drives and a tested swap procedure.
  7. Validation
    • Use cryptographically verifiable checksums and publish verification artifacts to ensure the broader community can validate your result.
These steps reflect StorageReview’s approach and the lessons they publicized from the 314T run. They form a practical checklist for anyone exploring long, IO-heavy numerical jobs.

Broader implications for HPC, AI, and storage engineering​

Storage-first thinking matters for some HPC workloads​

Many HPC planners still prioritize CPU count and interconnect for distributed compute. StorageReview’s experiment underlines that balanced system design is necessary: when workloads are checkpoint- and scratch-dominated, local, high-bandwidth NVMe arrays and predictable PCIe topologies can beat larger clusters that rely on shared storage fabrics.
This lesson extends to training and data‑preprocessing phases in AI pipelines where models or datasets force large streaming IO: careful co-design of compute, memory, and storage topologies can yield faster time-to-solution and lower energy budgets.

Where cloud and shared storage still win​

Cloud and shared-storage clusters retain advantages in redundancy, ease of scaling, and operational safety. The Linus Media Group + KIOXIA run to 300 trillion digits and other cloud-backed projects showed those platforms’ strengths: the ability to pool many devices and live-migrate jobs, with mature redundancy and operational frameworks. For organizations that require high availability and minimal single-point-of-failure risk, distributed approaches will remain preferable despite potential energy or wall-time penalties. StorageReview’s single‑server run is a complementary datapoint, not a wholesale replacement for distributed design patterns.

What’s verifiable — and where to be cautious​

  • Verifiable: the run’s headline numbers (314 trillion digits, 110 days, Dell PowerEdge R7725, dual EPYC 9965 CPUs, 40 × Micron 6550 Ion NVMe, Ubuntu 24.04.2) are reported directly by StorageReview and corroborated by independent press coverage. Those figures are reproducible claims and are supported by published metrics (I/O totals, SMART counters, checkpoints).
  • Caution: architectural comparisons that imply one approach is always better (single server vs cluster) depend heavily on workload assumptions around restart policy, acceptable risk, and total cost of ownership. StorageReview’s approach traded redundancy for throughput; that trade-off must be explicit in any reproduction or extrapolation. Also, small differences in backplane revisions, CPU PCIe lane counts, or SSD firmware can materially change achievable aggregate I/O — these platform-level differences are often hard to reproduce exactly across vendors and generations.

The takeaways — why this matters to WindowsForum readers​

  • For enthusiasts who build high‑throughput, long‑running workloads, this run is a clear reminder: storage architecture belongs at the front of design discussions, not as an afterthought. Properly provisioned NVMe, direct PCIe topologies, and conservative firmware choices yield outsized wins on streaming-heavy jobs.
  • For sysadmins and administrators planning large scratch workloads, the record demonstrates the practical efficiency gains from careful NUMA, OS, and cooling tuning — all measurable and actionable.
  • For infrastructure architects, the result illustrates that different workloads demand different design patterns: the cloud is not universally optimal, and local, highly tuned appliances can beat distributed systems for certain IO-bound problems.
StorageReview’s experiment is less about the vanity of digits and more about what works when you push a workload past trivial scale: design for the real bottleneck, monitor and protect your critical outputs, and make explicit trade-offs about redundancy, energy and component lifetime. The experiment’s public metrics give practitioners a concrete starting point to plan similar efforts or to re-evaluate storage-first architectures for production systems.

Conclusion​

The 314 trillion‑digit π run is both an engineering feat and a practical lesson: at extreme scale, storage bandwidth and stability determine success as surely as CPU cores and clock speeds. StorageReview’s single‑server approach shows that with the right hardware topology, NVMe density, cooling, and kernel tuning, a compact, energy-efficient system can outperform sprawling clusters for a class of IO-bound workloads. The broader message for system builders is clear — balance matters, and when you push the envelope, the simplest architecture that removes network and distributed complexity can be the most effective.
Community discussion and follow-up experiments are already gathering in technical forums and review sites, underscoring how much practical value the π‑race still delivers: new insights into storage firmware behavior under continuous load, drive endurance telemetry, and the impact of backplane architecture on aggregated PCIe performance. For practitioners, those conversations are the next phase — validating, refining, and applying the lessons from StorageReview’s 314T run to real-world science and engineering problems.

Source: Tom's Hardware https://www.tomshardware.com/pc-com...etakes-the-crown-thanks-to-storage-bandwidth/
 

Back
Top