Murakkab: Plain-Language Agent Workflows Optimized for Cost, Energy, and Compute

Researchers from MIT and Microsoft Azure have unveiled Murakkab, a system scheduled for presentation at OSDI 2026 that lets developers describe agentic AI workflows in plain language while automatically optimizing models, tools, hardware, scheduling, energy use, and cloud cost at deployment time. The headline number is not subtle: in reported tests, Murakkab used about 35 percent of the compute, 27 percent of the energy, and less than a quarter of the cost of more conventional approaches. The deeper story is that agentic AI is forcing cloud platforms to confront a mess of their own making. If agents are going to become the default shape of enterprise software, the industry can no longer treat orchestration as a developer convenience layer; it has to become an infrastructure discipline.

Futuristic infographic shows an “agentic AI pipeline” in a server room, optimizing compute, energy, and costs.The Agent Boom Has Reached Its Resource Hangover​

The past two years of AI product strategy have treated agents as the obvious next step after chatbots. Instead of asking one model one question, the new pitch is a constellation of models, tools, databases, code interpreters, retrieval systems, validators, and planners working together to complete a task. The result can be genuinely useful, especially when a system must inspect a video, write code, query a data source, and reason over the results.
But every step in that chain has a cost. A model call consumes accelerator time. A tool call may trigger another service. A retry loop can quietly double or triple the work. A decision to use the most capable model for every subtask may look safe to the developer and disastrous to the cloud bill.
That is the problem Murakkab is trying to solve. The system’s name, drawn from an Urdu word meaning a composition of things, is apt because agentic workflows are less like a single application and more like a small, moving supply chain. The central claim from the MIT and Microsoft researchers is that today’s systems are inefficient not merely because individual models are expensive, but because the whole workflow is usually opaque to the platform running it.
That distinction matters. Cloud providers already know how to schedule jobs, pack workloads, meter resources, and sell performance tiers. What they often cannot see is whether an AI application’s internal sequence of model calls and tool invocations is sensible, parallelizable, overprovisioned, or wasteful. Murakkab proposes to make the workflow legible enough that the cloud can optimize it.

Hard-Coding the Agent Stack Was Always a Temporary Hack​

Most early agentic systems were built the way early web applications were built: with a lot of manual wiring, a lot of assumptions, and a certain amount of hope. Developers choose a planner model, decide which tools it can call, stitch together retrieval and validation steps, specify which components run in sequence, and pick hardware or service tiers that seem likely to work. If latency is poor, they tune. If the bill is too high, they tune again.
That approach is tolerable when the workflow is small and the team understands every component. It becomes brittle when the system is composed of black-box models from multiple vendors, external tools with their own latency profiles, and user-facing service-level objectives that vary by customer. A video question-answering workflow, for example, might need to extract frames, transcribe speech, summarize visual content, select relevant segments, and answer a user’s question. There is no single “right” implementation; there are many implementations with different speed, accuracy, cost, and energy characteristics.
The bigger the design space, the less believable manual optimization becomes. Developers can overfit to yesterday’s model catalog, yesterday’s GPU availability, and yesterday’s pricing. A better small model may appear. A new accelerator may alter the cost curve. A particular stage may turn out not to need a frontier model at all.
Murakkab’s first move is to separate what the developer wants from how the platform executes it. Instead of requiring the developer to hard-code every technical choice upfront, the system accepts a higher-level description of the desired workflow. From there, it can select models and tools, infer dependencies, decide what can run in parallel, and generate an execution plan for the cloud provider.
That is a philosophical shift as much as a systems one. It says agentic workflow design should look less like hand-authoring a fragile script and more like declaring an objective that an optimizer can satisfy under constraints.

Microsoft’s Interest Is Not Academic​

It is easy to read Murakkab as a neat university systems paper and miss the industrial subtext. Microsoft Azure is deeply embedded in the work, with Microsoft researchers among the authors and Ricardo Bianchini, a technical fellow and corporate vice president at Microsoft Azure, listed as the senior author. That matters because the pain Murakkab addresses is exactly the pain hyperscale clouds are now inheriting from the agent hype cycle.
Microsoft has spent heavily to position Azure as an AI infrastructure platform, not just a place to rent virtual machines. The company’s broader AI strategy now spans Copilot, Azure AI infrastructure, Microsoft Foundry, Fabric, GitHub, Windows, and enterprise security tooling. All of those ambitions become more complicated if agentic workloads are expensive, unpredictable, and difficult to schedule efficiently.
The cloud business has a straightforward incentive here: if agentic AI makes workloads more valuable but also more wasteful, the provider that can reduce waste without lowering user-visible quality gets an advantage. It can preserve margins, ease capacity pressure, offer more attractive pricing, and make stronger claims about energy efficiency. In a market where GPUs remain scarce enough to shape product roadmaps, better orchestration is not a green garnish. It is capacity strategy.
That also explains why Murakkab focuses on the full stack rather than a single clever scheduling trick. The system is described as using a declarative abstraction, profile-guided optimization, and an adaptive runtime. In plain English, it tries to understand the workflow, learn how candidate configurations behave, and adjust execution dynamically against user-defined service-level objectives.
That is exactly the kind of machinery a cloud provider wants if agentic AI becomes normal enterprise plumbing. The customer wants an answer within a latency budget, at an acceptable accuracy level, for an acceptable cost. The provider wants to satisfy that contract with as little wasted GPU time and energy as possible. Murakkab is a sketch of the broker between those interests.

The Energy Story Is Really an Inference Story​

The public debate over AI energy use often fixates on training giant models. Training is visible, spectacular, and easy to imagine as a warehouse of GPUs running for weeks. But the long-term economics of AI may depend just as much on inference: the repeated act of serving users after the model is built.
Agentic systems make inference harder to reason about. A normal chatbot interaction might involve one model response. A reasoning-heavy or agentic task can involve planning, tool use, retrieval, verification, code execution, re-ranking, and multiple rounds of generation. Even if each individual step is optimized, the workflow can become expensive because there are so many steps and because they may be poorly matched to the hardware and models used.
That is why Murakkab’s reported energy reduction is more interesting than the usual “new model is smaller” claim. The system is not simply compressing a model or swapping one neural network for another. It is attacking waste at the orchestration layer, where the wrong model can be assigned to a trivial step, where sequential work can hide parallelism, and where overprovisioned hardware can sit underused while still consuming power.
The researchers report that Murakkab consumed only about 27 percent as much energy as other methods in tests across workflows such as video question answering and code generation. In another example, it reportedly cut energy consumption by more than an order of magnitude with only about a 2 percent accuracy drop. Those numbers will need the usual scrutiny that any systems benchmark deserves: workload selection, baselines, cluster size, hardware assumptions, and SLO definitions all matter.
Still, the direction is plausible. Agentic workflows create many knobs, and today’s deployment stacks do not turn those knobs with much global awareness. A system that can see across the workflow has more chances to avoid waste.

The Developer Pitch Is Convenience, but the Platform Pitch Is Control​

For developers, Murakkab’s most appealing feature may be the plain-language workflow description. A developer describes the application intent — say, a system that extracts key moments from a video, transcribes it, and answers questions — and the platform figures out the composition. That sounds like one more step in the industry’s march toward higher-level programming abstractions.
But the more consequential audience is the cloud provider. Murakkab gives the platform visibility into workflow internals that are usually hidden behind API calls and application code. Once the platform can inspect those internals, it can make decisions the developer may not be able to make manually: which stages can share resources, which subtasks can run on cheaper hardware, which model substitutions preserve accuracy, and which execution schedule best meets a user’s latency requirement.
That can be good for users, but it also shifts power. The cloud provider becomes not just the host of the workflow but an active participant in deciding how the workflow is implemented. The developer specifies intent and constraints. The platform chooses the operational reality.
In enterprise IT, that tradeoff will feel familiar. Managed databases, serverless platforms, Kubernetes autoscalers, and SaaS productivity suites all made similar bargains. Give the platform more control, and it can absorb complexity. Give the platform too much control, and portability, explainability, and auditability become harder.
Murakkab does not erase that tension. It sharpens it. If the system automatically swaps models, changes hardware allocations, or reorders execution based on a customer’s latency or cost target, administrators will want logs, policy controls, reproducibility guarantees, and compliance boundaries. A financial services firm may not want a workflow silently routed through an unapproved model. A healthcare provider may care less about saving 20 percent if the optimization path complicates governance.

The Windows Angle Runs Through Azure, Copilot, and the New Back Office​

For WindowsForum readers, Murakkab is not a Windows feature in the familiar sense. It is not a Start menu toggle, a kernel change, or a new Group Policy setting. Its relevance runs through the infrastructure Microsoft is building behind the Windows and Microsoft 365 experiences that increasingly depend on AI services.
Copilot is the obvious example. The more Microsoft turns Copilot from a chat pane into an agentic fabric across Windows, Office, Teams, Edge, GitHub, Azure, and security products, the more its backend resembles the fragmented workflows Murakkab targets. Summarizing a meeting, searching enterprise data, creating a document, checking permissions, invoking a plugin, and validating an answer are not a single model call. They are a chain.
That chain has to run somewhere. For most Microsoft customers, the answer is Azure, whether directly or indirectly. If agentic AI is to become a routine part of business software, Microsoft needs the cost per useful task to fall and the predictability per task to rise. Otherwise, customers will encounter the same old enterprise problem in a new costume: a promising platform that becomes expensive once it leaves the demo stage.
Murakkab also points toward a future in which IT admins manage AI workloads less by picking individual models and more by defining policy. A team might specify that a workflow must answer within a certain latency, remain within a budget, preserve a minimum accuracy, use only approved models, or prefer lower-energy execution when possible. The orchestration layer would then search for a configuration that satisfies those constraints.
That would be a natural extension of how administrators already think about cloud governance. Budgets, regions, identity, compliance tags, and service-level objectives are policy language. Agentic AI needs an equivalent, because “let the agent decide” is not a governance model.

Benchmarks Are Promising, but Production Is Crueler Than a Paper​

The reported Murakkab results are strong enough to command attention. Using roughly 35 percent of the compute, 27 percent of the energy, and less than 25 percent of the cost compared with baseline methods is the kind of improvement that changes a spreadsheet. Microsoft Research’s summary also frames the gains as reductions of up to 2.8 times in GPU usage, 3.7 times in energy consumption, and 4.3 times in cost while maintaining service-level objectives.
But systems papers live or die by how much of the benchmark survives contact with production. Real enterprise workflows are messy. They include private data, uneven request patterns, strict access controls, model version pinning, regional data residency, vendor-specific APIs, and human expectations that are harder to encode than latency targets.
There is also the problem of evaluation itself. Accuracy in a video Q&A workflow may be measurable on a benchmark, but enterprise usefulness often depends on context. A 2 percent accuracy drop may be acceptable for a low-stakes content search tool and unacceptable for a security incident response workflow. A cost-saving model substitution may look rational until it changes the tone, format, or edge-case behavior of a business-critical process.
Murakkab’s promise, then, is not that it can magically optimize every agent. It is that it can expose and navigate tradeoffs that are currently hidden. That alone would be progress. An organization that can choose between “fast and expensive,” “cheap and slightly less accurate,” and “low-energy with bounded latency” is in a better position than one that discovers the tradeoff only after the invoice arrives.
The more interesting production question is whether such systems become open and portable or cloud-specific and sticky. A declarative workflow abstraction could, in theory, help developers move across providers. In practice, the deepest optimization depends on knowing the provider’s hardware fleet, scheduling policies, model catalog, and pricing. That tilts the advantage toward hyperscalers.

The Green AI Debate Needs More Systems Work and Less Theater​

AI energy discourse has a habit of splitting into two unsatisfying camps. One side treats every model query as an ecological scandal. The other side waves away energy concerns by invoking future efficiency gains and the economic value of automation. Both arguments miss the operational middle where most of the important engineering happens.
Murakkab lives in that middle. It does not claim that AI is free, and it does not ask users to stop building agentic systems. It says the industry should stop wasting resources through bad composition, poor visibility, and static configuration. That is a more practical position than moralizing about every prompt or pretending infrastructure costs will solve themselves.
The timing is important because agentic AI changes the unit of demand. The old question was often “How much energy does one model response use?” The emerging question is “How much energy does one completed task use?” A completed task might involve dozens of model and tool calls. It may also displace human labor, reduce search time, automate a support workflow, or generate new demand that did not previously exist.
That complexity makes simple comparisons dangerous. But it also makes orchestration efficiency essential. If the industry cannot measure and optimize the full task pipeline, it will undercount waste, overstate savings, and make poor capacity decisions.
In that sense, Murakkab is part of a broader maturation of AI infrastructure. The first phase of the generative AI boom was about model capability. The second was about product integration. The next phase is about unit economics: latency, reliability, utilization, energy, and cost per completed workflow. That phase is less glamorous, but it is where enterprise adoption either becomes durable or stalls.

A Cloud Optimizer Cannot Replace Architectural Judgment​

There is a temptation to imagine Murakkab as a cure for the complexity of agentic AI. It is better understood as a tool for managing complexity that should not have been pushed entirely onto application developers in the first place. The system can search configurations, profile alternatives, and reallocate resources, but it cannot decide every business constraint on behalf of the organization.
Developers and architects still need to define sensible workflows. They need to decide what the agent is allowed to do, what data it can access, where human approval is required, and how failures are handled. An optimizer can make a dangerous workflow cheaper. That is not the same as making it safe.
Security also deserves more attention than it often gets in agent performance discussions. Agentic workflows expand the attack surface because they connect models to tools, code, documents, APIs, and sometimes privileged enterprise systems. If a platform dynamically changes components or execution paths, defenders need to understand what changed and why. Optimization without observability will not fly in serious environments.
The same is true for compliance. If a workflow is subject to data residency or sector-specific rules, the orchestrator must respect those boundaries as first-class constraints, not afterthoughts. Saving energy by moving work to a different region or model endpoint may be unacceptable if it violates policy. The most useful version of Murakkab-like orchestration will be one that treats governance as part of the optimization problem.
That is where Microsoft has both an opportunity and a burden. Azure already sells enterprises on identity, compliance, policy, and security controls. If Microsoft eventually folds Murakkab-like ideas into Azure AI services, customers will expect those controls to extend cleanly into the agent orchestration layer.

The Real Breakthrough Is Making Waste Visible​

Murakkab’s most important contribution may not be any single optimization technique. It is the insistence that agentic workflows should expose enough structure for the platform to reason about them. Opaque chains of model calls are convenient for developers but hostile to efficient infrastructure.
Once the workflow is visible, many familiar systems ideas become available. The platform can profile components. It can compare execution plans. It can exploit parallelism. It can assign smaller models to simpler subtasks. It can schedule around available hardware. It can adapt when a new model or accelerator changes the tradeoff curve.
That is not magic; it is cloud computing doing what cloud computing has always done when an abstraction becomes popular enough to industrialize. Virtual machines became autoscaling groups. Containers became orchestrated clusters. Databases became managed services. Agent scripts are now candidates for the same treatment.
The risk is that the abstraction arrives before the standards do. If every provider invents its own declarative agent workflow format, enterprises may face a new portability problem just as they are trying to reduce complexity. Developers could end up trading hard-coded model calls for hard-coded cloud orchestration semantics.
Still, the alternative is worse. Leaving agentic workflows as bespoke application logic guarantees inefficiency at scale. The industry can argue about the right abstraction, but it cannot avoid the need for one.

The Numbers That Should Survive the Hype Cycle​

Murakkab is still research, not a generally available Azure product administrators can deploy tomorrow. The concrete lesson is not to wait for a branded service, but to start asking sharper questions about agentic AI architecture now. The impressive benchmark figures are useful because they quantify how much slack may exist in current approaches.
  • Murakkab was developed by researchers from MIT and Microsoft Azure to optimize agentic workflows across model choice, tool composition, hardware allocation, scheduling, energy use, and cost.
  • The system lets developers describe workflow intent at a high level rather than manually hard-coding every model, tool, dependency, and deployment decision.
  • Reported tests on workloads including video question answering and code generation used about 35 percent of the compute, about 27 percent of the energy, and less than 25 percent of the cost of traditional approaches.
  • The strongest practical idea is that cloud platforms need visibility into workflow structure before they can optimize agentic AI efficiently.
  • Enterprise adoption will depend on whether these optimizers can respect governance, security, compliance, reproducibility, and model-approval policies as strictly as they optimize latency and cost.
  • For Microsoft, the research aligns with a larger Azure and Copilot reality: agentic AI cannot scale economically if every workflow behaves like a hand-built bundle of hidden, overprovisioned calls.
Murakkab should be read as an early warning as much as an efficiency breakthrough: the agent era will not be won only by the company with the largest model or the newest GPU fleet, but by the platforms that can turn sprawling AI workflows into measurable, governable, and resource-aware systems. If Microsoft can carry that lesson from research into Azure’s production fabric, the next generation of Copilot-like services may become not just more capable, but less wasteful by design.

References​

  1. Primary source: AZoCleantech
    Published: 2026-06-27T18:30:14.013956
  2. Related coverage: news.mit.edu
  3. Official source: microsoft.com
  4. Related coverage: techradar.com
  5. Related coverage: the-agent-report.com
  6. Related coverage: dig.watch
  1. Related coverage: welcome.ai
  2. Related coverage: goharirfan.me
 

Back
Top