Microsoft NSDI ’26 Papers Reveal How Azure Will Scale AI: Network, Memory, Security

  • Thread Author
Microsoft said on May 5, 2026, that 11 papers by its researchers and collaborators were accepted at NSDI ’26, the USENIX Symposium on Networked Systems Design and Implementation taking place May 4–6 in Renton, Washington. The announcement is not merely academic bragging rights. It is a map of where Microsoft thinks cloud infrastructure is breaking under the pressure of AI, and where the next round of platform advantage will be won. The striking theme is that “the network” is no longer a cable-and-switch problem; it is now a scheduling, memory, security, inference, offload, and operations problem wrapped around every cloud service Microsoft sells.

Neon-blue holographic network diagram linking servers and data cubes in a dark server room.Microsoft Is Showing the Machinery Behind the AI Boom​

The public conversation about AI infrastructure still tends to orbit GPUs, power contracts, and datacenter real estate. Those are the visible bottlenecks, and they are expensive enough to dominate earnings calls. But NSDI is where the less photogenic machinery gets its moment: cache reuse, live migration, network offload, collective communication, optical fault recovery, and tenant isolation.
Microsoft’s NSDI ’26 slate reads like a company trying to make the AI cloud less magical and more industrial. The papers span datacenter networks, wide-area networking, AI systems, cloud infrastructure, video analytics, eBPF safety, and memory disaggregation. That breadth matters because the hyperscale cloud is no longer a stack of separable layers. A change in model serving ripples into memory pressure; a change in memory architecture changes network design; a change in network offload changes security boundaries.
This is the unglamorous side of the AI race. Microsoft can buy accelerators, lease power, and market Copilots, but its durable advantage depends on whether Azure can squeeze more useful work out of every watt, every port, every server, and every packet. NSDI is not the stage where Microsoft announces a consumer product. It is where the company previews the tricks that make those products affordable to run.

The AI Papers Are Really About Waste​

DroidSpeak, one of the Monday papers, targets a quiet inefficiency in LLM serving: duplicated work across related models. The idea is that LLMs with the same architecture can share and partially reuse KV caches across model variants, improving throughput and response time with minimal impact on output quality. Microsoft’s summary claims up to 4 times higher throughput, which is the kind of number that makes infrastructure people stop reading and start asking implementation questions.
The significance is not just that a single inference technique is faster. It is that model customization has created a fleet-management problem. Enterprises want fine-tuned, specialized, governed, and differentiated models, but cloud economics strongly prefer standardization. If cache reuse can bridge that gap, Microsoft gets a way to support model variety without paying full freight for every variant.
ForestColl attacks a different AI bottleneck: collective communication. Modern accelerator clusters are not merely computing; they are constantly broadcasting, aggregating, synchronizing, and waiting. ForestColl constructs broadcast and aggregation spanning trees as the communication schedule, claiming theoretical optimality and support for both switching fabrics and direct accelerator connections. That matters because the “GPU shortage” is often partly a “GPU waiting on the network” shortage.
The AI story continues with AVA, an open-ended video analytics system that combines event knowledge graphs with agentic retrieval over vision-language models. Microsoft’s write-up says the authors introduce AVA-100, a benchmark built from eight videos exceeding 10 hours each, with 120 manually annotated question-and-answer pairs, and that AVA reaches 75.8 percent accuracy. That is not just a computer vision result; it is another sign that enterprise AI workloads are moving from clean demos to messy, long-running, real-world data streams.

The Datacenter Network Is Becoming a Computer​

The most consequential cloud networking work often looks, from a distance, like a demotion of the traditional network. Switches matter, but they are no longer the only place where networking intelligence lives. SmartNICs, programmable pipelines, direct accelerator links, disaggregated memory fabrics, and offload systems are turning the datacenter network into a distributed computer in its own right.
Octopus is an example of this shift. The paper proposes a switch-free design for disaggregated memory pods, aimed at reducing cost and scaling to multi-rack deployments. On a three-server prototype, Microsoft says Octopus RPCs were 3.2 times faster than in-rack RDMA and 2.4 times faster than CXL switches. The claim is narrow enough to be technical, but the implication is broad: memory architecture is becoming a network architecture problem.
That is important because AI and cloud services are both memory-hungry in ways that conventional server design handles poorly. If memory can be pooled, shared, or disaggregated efficiently, operators can reduce stranded capacity and reshape server design around workload demand rather than fixed machine boundaries. But every move in that direction increases dependence on low-latency, predictable, failure-aware networking.
Pyrocumulus, a live-migration system for storage-optimized VMs, pushes in a similar direction. It uses FPGA SmartNIC capabilities and a live-migration protocol to reduce overhead and migration time. Live migration is one of those cloud primitives users rarely think about until it fails. For operators, it is a lever for maintenance, resilience, load balancing, and hardware refresh. Faster migration means more freedom to move workloads without turning every infrastructure change into a customer-visible event.

SONiC DASH SmartSwitch Is the Paper With Product Gravity​

The most product-adjacent item in Microsoft’s NSDI list is the Community Award-winning paper on SONiC DASH SmartSwitch. Microsoft describes it as a redesign of cloud network offloading using a hardware-friendly pipeline, a unified switch architecture, and an open development model. More importantly, the company says it is deployed at scale in Azure.
That last phrase changes the temperature. Plenty of systems papers propose elegant architectures. Far fewer describe something already carrying production traffic in one of the world’s largest clouds. When Microsoft says SONiC DASH SmartSwitch improves throughput, connection capacity, power efficiency, and space efficiency, it is talking about benefits that matter directly to Azure margins and datacenter build-out.
SONiC has long been one of Microsoft’s more strategically interesting infrastructure bets because it converts networking from a closed appliance model into a more open, software-driven ecosystem. DASH, which focuses on offloading cloud network services, extends that logic into the data plane demands of multi-tenant cloud. The SmartSwitch framing suggests Microsoft wants more cloud service processing done closer to the packet path, but without turning every deployment into a custom hardware science project.
For WindowsForum readers, the consumer connection is indirect but real. Azure’s ability to offload, isolate, and route traffic efficiently affects the cost and performance envelope of Microsoft 365, Xbox services, GitHub, Windows cloud management, Copilot, and enterprise workloads. The PC may sit at the edge, but the user experience increasingly depends on infrastructure decisions buried inside Azure racks.

LLMs Are Entering the Network Engineer’s Toolbox​

Eywa is one of the more intriguing papers because it turns LLMs back onto the systems that make LLMs possible. The system uses LLMs to build protocol models from natural-language sources, enabling model-based testing. Microsoft says it uncovered 33 bugs, including 16 previously unknown bugs, in widely used network protocol implementations.
This is exactly the sort of use case where AI may prove more valuable than the most theatrical chatbot demos. Protocol specifications are dense, ambiguous, and historically difficult to translate into exhaustive tests. If an LLM can help produce useful models from those documents, it becomes a force multiplier for systems correctness.
The obvious caveat is that LLM-generated models must themselves be checked. A bad model can create false confidence, and protocol behavior is not a place where vibes-based engineering belongs. But the direction is powerful: AI systems are beginning to help reason about the lower layers of computing, not merely consume them.
MetaEase fits the same operational theme from another angle. It analyzes heuristics directly from source code to uncover worst-case performance scenarios, avoiding complex formal modeling. Heuristics are everywhere in production systems because perfect optimization is usually too slow, too brittle, or too expensive. The problem is that heuristics can hide pathological cases until scale exposes them. A tool that finds those cases earlier is less glamorous than a new AI model, but likely more valuable to an operator responsible for global reliability.

Optical Networks Are Now Part of Cloud Reliability​

HEDGE, a paper involving Cornell, NYSERNet, Microsoft, and Meta authors, focuses on wavelength-specific faults in optical networks. That may sound specialized, but optical transport is part of the circulatory system of modern cloud infrastructure. The paper combines link-local and global network-wide resilience to maintain stable capacity and optimize traffic despite fluctuating link performance.
The key idea is that cloud reliability increasingly depends on partial degradation rather than outright failure. A link may not simply be up or down. A wavelength may underperform, a path may become unstable, or capacity may fluctuate in ways that traditional failover logic handles poorly. Hyperscale systems need to keep serving traffic through gray failure, not merely recover after a clean break.
This is where Microsoft’s NSDI presence feels less like a collection of papers and more like a doctrine. The company is investing in systems that assume imperfection: imperfect links, imperfect heuristics, imperfect isolation, imperfect utilization, imperfect memory locality, and imperfect human-written specifications. The cloud is too large for perfect conditions. The winning operator is the one that turns imperfection into a managed variable.

Security Is Being Pulled Down Into the Runtime​

KRAKENGUARD addresses eBPF, one of the most powerful and dangerous tools in modern Linux-based infrastructure. eBPF allows programs to run inside the kernel in controlled ways, enabling observability, networking, security enforcement, and performance tooling. It is also an obvious source of anxiety in multi-tenant environments because fine-grained power inside the kernel is still power inside the kernel.
The paper proposes policy-based controls on eBPF programs at load time using symbolic execution. Microsoft’s summary says it can enforce controls without relying on coarse Linux capabilities, prevent malicious behavior, detect vulnerabilities, and allow secure execution of untrusted programs with strong isolation guarantees. In plainer terms, it is an attempt to make eBPF safer without throwing away the flexibility that made eBPF valuable in the first place.
That reflects a broader industry movement. Cloud security cannot be a perimeter product bolted on after the fact. It has to be embedded into schedulers, kernels, network pipelines, offload devices, and deployment systems. The more programmable the cloud becomes, the more security has to travel with the program.
HarvestContainers tackles another production tension: utilization versus latency. The system protects latency-sensitive containers from interference while using spare CPU cores for latency-tolerant work. Microsoft says it can use up to 75 percent of spare CPU while keeping tail latency within 4 percent of standalone performance. That is the sort of result cloud providers love because unused capacity is not just waste; it is paid-for waste.

Microsoft’s Research Agenda Is Also a Business Strategy​

There is a temptation to treat conference papers as detached from corporate strategy. That would be a mistake here. Microsoft’s NSDI ’26 lineup maps cleanly onto the pressure points of Azure’s business: AI inference cost, accelerator communication, memory scaling, network offload, live migration, protocol correctness, container utilization, optical resilience, and programmable security.
The company is not alone in working on these problems. The papers include collaborators from the University of Chicago, UCLA, Columbia, Wisconsin-Madison, Cornell, Meta, Zhejiang University, Tsinghua, the University of Toronto, MIT, Rice, Georgia Tech, the Chinese University of Hong Kong, IIT Roorkee, Imperial College London, and others. That breadth is typical of systems research, but it also shows how hyperscale infrastructure is now too complex for any one lab to solve in isolation.
Still, Microsoft’s role as a returning sponsor, program committee participant, and author or collaborator on 11 accepted papers gives the company a strong institutional presence. This is not just a logo on a conference page. It is Microsoft inserting itself into the research pipeline that will shape the next generation of cloud infrastructure.
For Azure, that pipeline matters because the cloud market is no longer defined only by who has the most regions or the cheapest virtual machines. The strategic question is who can deliver specialized, AI-heavy, security-sensitive, globally distributed workloads at acceptable cost and reliability. That is a systems problem before it is a sales problem.

The Windows Angle Is the Cloud Angle​

A Windows enthusiast might reasonably ask why an NSDI paper on optical faults or disaggregated memory belongs on their radar. The answer is that Windows has become one endpoint in a Microsoft platform whose center of gravity is increasingly cloud infrastructure. Windows Update, Intune, Defender, Entra, Microsoft 365, OneDrive, Azure Virtual Desktop, Dev Box, Windows 365, and Copilot all depend on the same operational substrate.
The client operating system still matters, but it no longer carries the full user experience alone. A Windows PC in 2026 is often a local interface to identity services, policy engines, telemetry systems, cloud storage, AI inference endpoints, and remote management backends. When those systems get faster, cheaper, or more reliable, Windows users feel it indirectly. When they fail, Windows users feel that too.
That is why Microsoft’s research into networked systems has a consumer and enterprise afterlife. Better container harvesting can lower service costs. Safer eBPF can improve observability and security tooling. Smarter offload can reduce latency and power draw. Faster live migration can make maintenance less disruptive. KV cache sharing can make AI features more responsive or less expensive to serve.
The risk is that these improvements also deepen dependence on infrastructure users cannot inspect. Microsoft’s cloud stack becomes more capable and more opaque at the same time. That is not unique to Microsoft, but it is central to the modern Windows experience: more features arrive as services, and more of the service logic lives beyond the machine.

The Real Announcement Is That Scale Has Become the Product​

The most concrete lesson from Microsoft’s NSDI ’26 slate is that infrastructure efficiency is no longer a back-office concern. It shapes product latency, AI availability, enterprise reliability, security boundaries, and cloud margins. The papers differ in technique, but they point toward the same operational thesis: hyperscale computing now advances by reducing waste and managing partial failure everywhere.
  • Microsoft had 11 accepted NSDI ’26 papers with collaborators across AI systems, datacenter networking, wide-area networks, cloud infrastructure, and security.
  • DroidSpeak and ForestColl target AI infrastructure efficiency by reducing duplicated inference work and improving accelerator communication schedules.
  • Octopus, Pyrocumulus, and SONiC DASH SmartSwitch show Microsoft pushing more infrastructure intelligence into memory fabrics, SmartNICs, and offload pipelines.
  • Eywa and MetaEase use automation to find correctness and performance problems that are difficult for humans to model exhaustively at cloud scale.
  • HEDGE and KRAKENGUARD reflect the new reliability and security reality, where partial failures and programmable runtimes must be governed before they become incidents.
  • The practical payoff for Azure and Microsoft customers is not one feature, but a platform that can run denser, faster, safer, and cheaper services under AI-era demand.
Microsoft’s NSDI ’26 showing is not a product launch, and that is exactly why it is worth watching. The company is exposing the layer where the next cloud competition will be fought: not in the chatbot window, but in the systems that decide whether billions of AI calls, migrations, packets, containers, and memory accesses can be handled without collapsing the economics. If the last decade of cloud was about building global capacity, the next one will be about making that capacity behave intelligently under pressure.

Source: Microsoft Microsoft at NSDI 2026: Advances in large-scale networked systems - Microsoft Research
 

Back
Top