KAIST researchers said on July 5, 2026, that AI agents running complex tool-using tasks can consume up to 136.5 times more energy per question than conventional chatbot-style generative AI, based on an analysis of real-service workloads using large language models. The finding, reported by Maeil Business Newspaper and echoed in KAIST’s own release, turns the agent boom into an infrastructure story. The question is no longer just whether an AI system can reason, search, calculate, and act; it is whether the data center behind it can afford to let the machine keep thinking.
The AI industry has spent the past two years selling agents as the next abstraction layer over software. A chatbot answers; an agent plans, invokes tools, writes code, searches the web, checks its own work, and tries again. That leap sounds like intelligence from the user’s side of the screen, but KAIST’s numbers show that it looks like a pileup of repeated model calls, idle GPUs, and elongated response times from the operator’s side.
According to KAIST’s School of Electrical Engineering, the team led by chair professor Yoo Min-soo analyzed computational cost, response latency, energy consumption, and data-center-scale power demand for AI agents. The study frames agents not as a magical upgrade to chatbots but as a new kind of workload that data centers must schedule, power, cool, and pay for. That distinction matters because the agent is not one inference; it is a loop.
The headline number is brutal. For an agent using a 70-billion-parameter large language model, a scale KAIST describes as comparable to today’s commercial AI services, the average energy use reached 348.41 watt-hours per question. Maeil Business Newspaper reported that this is up to 136.5 times the energy required by conventional generative AI answering a simple question.
That does not mean every agent request will burn a third of a kilowatt-hour, and it does not mean every chatbot query is clean by comparison. It means the industry’s preferred direction of travel — more autonomy, more tools, more retries, more “thinking” — carries a measurable and potentially enormous cost. The more the software behaves like a junior employee, the more the infrastructure behaves like a factory floor.
That is the opposite of the one-shot interaction many users imagine when they type into an AI box. The system may break a task into subtasks, call a model to plan, call a search tool, call the model again to interpret results, call a calculator or code executor, call the model again to revise the plan, and then call it yet again to format the answer. Each step may improve the final output, but each step also stretches the wall-clock time and complicates GPU utilization.
The GPU problem is especially revealing. KAIST found that while agents used external tools such as search and calculation, GPUs spent up to 54.5 percent of total execution time on standby without doing useful computation. In plain English, the expensive accelerator sits there waiting while the agent’s workflow wanders through external calls and orchestration overhead.
This is a different bottleneck from the one that defined the first wave of generative AI scaling. Training massive models was a capital-intensive exercise in assembling enough GPUs and feeding them enough data. Agentic inference is messier: it is interactive, bursty, serial, and dependent on other systems. The model is no longer the whole workload; it is one actor in a procedural drama.
That is why the agent era may be less about raw FLOPS and more about orchestration. The winning systems will not simply be the ones with the biggest models. They will be the ones that know when not to call them.
Accuracy gains from additional computation are not linear. Once a model is already capable, extra calls can deliver diminishing returns: a slightly better answer, a cleaner plan, a more confident verification step. The trouble is that electricity, latency, cooling, and infrastructure congestion do not care whether the marginal improvement is philosophically impressive.
This is where the industry’s benchmark culture becomes dangerous. A model that scores a few points higher on a complex reasoning test may look superior in a leaderboard table, but if the agent wrapped around it burns vastly more energy per solved task, the product calculus changes. For consumers, that may show up as slower responses or usage caps. For enterprises, it shows up as cloud bills, procurement delays, GPU scarcity, and sustainability reports that suddenly look harder to defend.
The KAIST team’s proposed answer is not to abandon agents. It is to make them more selective. Maeil Business Newspaper described the researchers’ suggestion of “calculated cognitive reasoning,” where an agent uses models of different sizes and weighs how much computation a task deserves before answering. A simple query should not automatically summon a large model into a multi-step loop.
This idea sounds obvious until you remember how much of the AI industry has been optimized around maximum capability rather than proportionality. If the same heavyweight agent is used to summarize a short email, book a meeting, debug a script, and produce a regulatory analysis, the system is not intelligent in any operational sense. It is merely powerful.
That projection is hypothetical, but its value is not in predicting a precise future. It clarifies the order of magnitude. If even a fraction of today’s search, office, coding, customer-service, and automation traffic migrates from simple lookups to tool-using agents, the infrastructure burden becomes a first-class constraint.
This is already visible in the market. AI companies talk about model releases, but they negotiate for power. Cloud providers talk about regions, but they fight for substations. Governments talk about AI sovereignty, but they increasingly mean land, transmission lines, cooling water, and energy policy. The software layer has become inseparable from the electrical layer.
For WindowsForum readers, this is not an abstract hyperscaler problem. Windows PCs are already being reimagined as AI endpoints, with local NPUs, cloud-connected copilots, enterprise agents, and developer tools that increasingly assume model access. The question for IT shops is not whether AI features will arrive; it is where the inference runs, how often it runs, and who pays when the agent decides it needs 70 model calls to answer a ticket.
The pressure will land unevenly. A consumer asking an AI assistant to plan a vacation may never see the full energy cost. A sysadmin deploying agentic support workflows across thousands of endpoints will see latency, throttling, authentication complexity, and vendor pricing. A cloud architect will see regional capacity limits and contract language. The same agent demo that looks slick on stage can become a systems problem at scale.
Modern GPUs are astonishingly efficient at dense numerical work when they are fed properly. Agent workloads often fail to feed them properly because they introduce serial dependencies. The agent cannot call the next model step until a search result returns, a tool executes, a code cell finishes, or an intermediate answer is parsed. That creates bubbles in the pipeline.
In conventional computing, this kind of inefficiency invites a familiar set of engineering responses: batching, caching, speculative execution, smaller specialized models, asynchronous orchestration, and better scheduling. The agent world will need all of them, but it will also need a cultural shift. Developers must stop treating the large model call as free glue.
The irony is that agents are marketed as automation for human inefficiency, yet many current implementations appear to automate computational inefficiency. They replace a user’s sequence of clicks with a machine’s sequence of expensive deliberations. The user’s time may be saved, but the data center absorbs the mess.
This is not necessarily a permanent indictment. Early web applications were inefficient. Early mobile apps were battery hogs. Early virtualization deployments wasted resources before the tooling matured. But those earlier transitions became sustainable only after resource constraints became product requirements. AI agents are approaching that moment faster than their promoters would like.
That is exactly where agents become attractive. A Windows or Microsoft 365 agent that can inspect documents, query mail, update calendars, generate scripts, summarize Teams meetings, file tickets, and trigger workflows is more useful than a chatbot that merely explains how to do those things. It is also closer to the workload KAIST measured: repeated model calls, external tools, permissions, and orchestration.
For enterprise IT, this raises a different set of questions than the usual privacy and governance checklist. How many agentic actions will a tenant generate per day? How much of the computation happens in Microsoft’s cloud, on local hardware, or in a hybrid path? Will vendors expose enough telemetry for administrators to distinguish a cheap answer from an expensive task chain?
Windows administrators are used to thinking in terms of CPU, memory, disk, network, and endpoint battery. AI adds a hidden dimension: inference budget. A policy that allows an agent to “work until complete” may sound user-friendly, but in a large organization it becomes a resource allocation rule. Without controls, the most enthusiastic automation users may become the most expensive ones.
There is also a security wrinkle. Agents that use tools need permissions, and permissions shape the search space. A tightly constrained agent may be cheaper and safer because it has fewer places to go and fewer actions to evaluate. A broadly empowered agent may be more capable, but it may also spend more computation deciding what to do, checking what happened, and recovering from mistakes.
A simple chatbot query and an agentic task are not the same unit of consumption. Treating them as interchangeable “AI requests” hides too much. An agent that calls a large model dozens of times is closer to a mini workflow than a search query, and its energy profile should be judged accordingly.
That does not automatically make agents wasteful. If an agent prevents a human from spending an hour on a task, catches a security misconfiguration, or automates a costly business process, the energy tradeoff may be reasonable. The correct comparison is not always AI versus nothing; sometimes it is AI versus meetings, rework, travel, downtime, or human labor performed with less precision.
But the burden of proof should shift. Vendors should not be able to sell agentic systems as pure productivity magic while burying the infrastructure cost in cloud opacity. If agents become a default interface for computing, users and organizations will need meaningful visibility into latency, cost, and energy intensity.
The industry has already learned to expose token counts, rate limits, and context windows to developers. The next layer may be energy-aware APIs and orchestration dashboards that show how many model calls a task required, which model tiers were used, how long accelerators waited, and whether the answer was worth the loop. That would turn efficiency from a public-relations slogan into an engineering metric.
Co-design means the model architecture cannot be separated from the scheduler, the data center cannot be separated from the grid, and the product experience cannot be separated from the cost of inference. A system that chooses a smaller model for a simple task is not merely cheaper; it is better designed. A system that avoids unnecessary tool calls is not less intelligent; it is more disciplined.
This will be hard for an industry addicted to maximal demos. Agent showcases thrive on spectacular chains of action: open the browser, search the web, write code, run it, fix the error, prepare a report, send an email. The more steps, the more magical it looks. KAIST’s work reminds us that every step has a meter attached.
The likely future is not one agent to rule them all. It is a hierarchy of models and policies. Small local models will handle routine classification and drafting. Medium models will manage structured workflows. Large frontier models will be reserved for ambiguous, high-value, or high-risk tasks. External tools will be called when they are needed, not because the agent framework makes it easy.
For Windows users, that future may feel less glamorous than the current marketing. The best AI feature may be the one that quietly decides not to invoke the cloud. The best enterprise agent may be the one that solves 80 percent of tickets with a small model and escalates only the hard cases. The best developer assistant may be the one that knows when a static analyzer is cheaper and more reliable than another round of synthetic reasoning.
This is especially important for countries trying to build sovereign AI capability. Buying GPUs is difficult; powering them continuously is harder. A state-backed AI data center may look impressive on a press release, but if agent workloads multiply demand faster than efficiency improves, the bottleneck moves from semiconductor supply to electrical planning.
Companies face the same constraint in miniature. A CIO deciding whether to deploy agents across customer support, software development, security operations, finance, and HR is making an infrastructure decision even if the vendor sells it as a subscription feature. The cost will appear somewhere: in per-seat pricing, usage-based billing, throttled performance, regional availability, or reduced margins hidden inside a cloud bundle.
This may also change procurement language. Enterprises should ask vendors not only which model powers an agent, but how many model calls typical tasks require, what model tiers are invoked, whether smaller models are used by default, and what telemetry administrators can audit. “AI-ready” infrastructure should mean more than access to accelerators; it should mean policies for using them rationally.
The organizations that get this right will treat agent deployment like any other production system. They will profile workloads, set budgets, measure outcomes, and kill automations that do not earn their keep. The ones that get it wrong will discover that a thousand tiny conveniences can add up to one large power bill.
The Agent Boom Has Found Its Power Bill
The AI industry has spent the past two years selling agents as the next abstraction layer over software. A chatbot answers; an agent plans, invokes tools, writes code, searches the web, checks its own work, and tries again. That leap sounds like intelligence from the user’s side of the screen, but KAIST’s numbers show that it looks like a pileup of repeated model calls, idle GPUs, and elongated response times from the operator’s side.According to KAIST’s School of Electrical Engineering, the team led by chair professor Yoo Min-soo analyzed computational cost, response latency, energy consumption, and data-center-scale power demand for AI agents. The study frames agents not as a magical upgrade to chatbots but as a new kind of workload that data centers must schedule, power, cool, and pay for. That distinction matters because the agent is not one inference; it is a loop.
The headline number is brutal. For an agent using a 70-billion-parameter large language model, a scale KAIST describes as comparable to today’s commercial AI services, the average energy use reached 348.41 watt-hours per question. Maeil Business Newspaper reported that this is up to 136.5 times the energy required by conventional generative AI answering a simple question.
That does not mean every agent request will burn a third of a kilowatt-hour, and it does not mean every chatbot query is clean by comparison. It means the industry’s preferred direction of travel — more autonomy, more tools, more retries, more “thinking” — carries a measurable and potentially enormous cost. The more the software behaves like a junior employee, the more the infrastructure behaves like a factory floor.
Intelligence Is Becoming a Scheduling Problem
The uncomfortable insight in KAIST’s work is that agentic AI wastes time in ways that ordinary benchmark charts tend to hide. The study found that agent response times can rise by as much as 153.7 times compared with existing AI methods, largely because agents repeatedly call the language model while stepping through a task. One AI agent examined by the researchers reportedly invoked the LLM an average of 71 times for a single question.That is the opposite of the one-shot interaction many users imagine when they type into an AI box. The system may break a task into subtasks, call a model to plan, call a search tool, call the model again to interpret results, call a calculator or code executor, call the model again to revise the plan, and then call it yet again to format the answer. Each step may improve the final output, but each step also stretches the wall-clock time and complicates GPU utilization.
The GPU problem is especially revealing. KAIST found that while agents used external tools such as search and calculation, GPUs spent up to 54.5 percent of total execution time on standby without doing useful computation. In plain English, the expensive accelerator sits there waiting while the agent’s workflow wanders through external calls and orchestration overhead.
This is a different bottleneck from the one that defined the first wave of generative AI scaling. Training massive models was a capital-intensive exercise in assembling enough GPUs and feeding them enough data. Agentic inference is messier: it is interactive, bursty, serial, and dependent on other systems. The model is no longer the whole workload; it is one actor in a procedural drama.
That is why the agent era may be less about raw FLOPS and more about orchestration. The winning systems will not simply be the ones with the biggest models. They will be the ones that know when not to call them.
Bigger Models Are Not Always the Smartest Move
The agent pitch often assumes that more reasoning is inherently better. If the system can call a tool, ask itself follow-up questions, and verify its answer, surely it is closer to useful automation. KAIST’s analysis does not reject that premise, but it exposes the cost curve underneath it.Accuracy gains from additional computation are not linear. Once a model is already capable, extra calls can deliver diminishing returns: a slightly better answer, a cleaner plan, a more confident verification step. The trouble is that electricity, latency, cooling, and infrastructure congestion do not care whether the marginal improvement is philosophically impressive.
This is where the industry’s benchmark culture becomes dangerous. A model that scores a few points higher on a complex reasoning test may look superior in a leaderboard table, but if the agent wrapped around it burns vastly more energy per solved task, the product calculus changes. For consumers, that may show up as slower responses or usage caps. For enterprises, it shows up as cloud bills, procurement delays, GPU scarcity, and sustainability reports that suddenly look harder to defend.
The KAIST team’s proposed answer is not to abandon agents. It is to make them more selective. Maeil Business Newspaper described the researchers’ suggestion of “calculated cognitive reasoning,” where an agent uses models of different sizes and weighs how much computation a task deserves before answering. A simple query should not automatically summon a large model into a multi-step loop.
This idea sounds obvious until you remember how much of the AI industry has been optimized around maximum capability rather than proportionality. If the same heavyweight agent is used to summarize a short email, book a meeting, debug a script, and produce a regulatory analysis, the system is not intelligent in any operational sense. It is merely powerful.
The Data Center Is Now Part of the Model
KAIST’s most provocative projection takes the argument from the query level to the grid level. The researchers estimated that if 13.7 billion daily requests — roughly the scale of global Google searches cited in the Korean reporting — were handled by AI agents, data center power demand could reach about 198.9 gigawatts. KAIST characterized that as beyond the several-gigawatt AI data center projects now being promoted by governments and companies, and roughly half of total average U.S. electricity consumption.That projection is hypothetical, but its value is not in predicting a precise future. It clarifies the order of magnitude. If even a fraction of today’s search, office, coding, customer-service, and automation traffic migrates from simple lookups to tool-using agents, the infrastructure burden becomes a first-class constraint.
This is already visible in the market. AI companies talk about model releases, but they negotiate for power. Cloud providers talk about regions, but they fight for substations. Governments talk about AI sovereignty, but they increasingly mean land, transmission lines, cooling water, and energy policy. The software layer has become inseparable from the electrical layer.
For WindowsForum readers, this is not an abstract hyperscaler problem. Windows PCs are already being reimagined as AI endpoints, with local NPUs, cloud-connected copilots, enterprise agents, and developer tools that increasingly assume model access. The question for IT shops is not whether AI features will arrive; it is where the inference runs, how often it runs, and who pays when the agent decides it needs 70 model calls to answer a ticket.
The pressure will land unevenly. A consumer asking an AI assistant to plan a vacation may never see the full energy cost. A sysadmin deploying agentic support workflows across thousands of endpoints will see latency, throttling, authentication complexity, and vendor pricing. A cloud architect will see regional capacity limits and contract language. The same agent demo that looks slick on stage can become a systems problem at scale.
Idle GPUs Are a Symptom of a Young Stack
The most damning detail in the KAIST work may not be the 136.5-times energy figure. It may be the GPU standby figure. When up to 54.5 percent of execution time is spent waiting rather than calculating, the problem is not merely that agents are “expensive.” It is that the stack is immature.Modern GPUs are astonishingly efficient at dense numerical work when they are fed properly. Agent workloads often fail to feed them properly because they introduce serial dependencies. The agent cannot call the next model step until a search result returns, a tool executes, a code cell finishes, or an intermediate answer is parsed. That creates bubbles in the pipeline.
In conventional computing, this kind of inefficiency invites a familiar set of engineering responses: batching, caching, speculative execution, smaller specialized models, asynchronous orchestration, and better scheduling. The agent world will need all of them, but it will also need a cultural shift. Developers must stop treating the large model call as free glue.
The irony is that agents are marketed as automation for human inefficiency, yet many current implementations appear to automate computational inefficiency. They replace a user’s sequence of clicks with a machine’s sequence of expensive deliberations. The user’s time may be saved, but the data center absorbs the mess.
This is not necessarily a permanent indictment. Early web applications were inefficient. Early mobile apps were battery hogs. Early virtualization deployments wasted resources before the tooling matured. But those earlier transitions became sustainable only after resource constraints became product requirements. AI agents are approaching that moment faster than their promoters would like.
Microsoft’s AI Ambitions Run Into the Same Physics
Microsoft is not the subject of the KAIST study, but Windows and Azure sit squarely in the blast radius of its conclusions. The company has spent heavily to embed Copilot across Windows, Microsoft 365, GitHub, Security, Power Platform, and Azure. Its strategic direction is unmistakable: AI should move from a chat box into the workflow.That is exactly where agents become attractive. A Windows or Microsoft 365 agent that can inspect documents, query mail, update calendars, generate scripts, summarize Teams meetings, file tickets, and trigger workflows is more useful than a chatbot that merely explains how to do those things. It is also closer to the workload KAIST measured: repeated model calls, external tools, permissions, and orchestration.
For enterprise IT, this raises a different set of questions than the usual privacy and governance checklist. How many agentic actions will a tenant generate per day? How much of the computation happens in Microsoft’s cloud, on local hardware, or in a hybrid path? Will vendors expose enough telemetry for administrators to distinguish a cheap answer from an expensive task chain?
Windows administrators are used to thinking in terms of CPU, memory, disk, network, and endpoint battery. AI adds a hidden dimension: inference budget. A policy that allows an agent to “work until complete” may sound user-friendly, but in a large organization it becomes a resource allocation rule. Without controls, the most enthusiastic automation users may become the most expensive ones.
There is also a security wrinkle. Agents that use tools need permissions, and permissions shape the search space. A tightly constrained agent may be cheaper and safer because it has fewer places to go and fewer actions to evaluate. A broadly empowered agent may be more capable, but it may also spend more computation deciding what to do, checking what happened, and recovering from mistakes.
The Green AI Debate Just Got Less Abstract
For years, arguments about AI energy use have often collapsed into two unhelpful camps. One side claims AI is an ecological disaster in waiting; the other insists that efficiency improvements and useful applications will justify the load. KAIST’s study is useful because it moves the debate from vibes to workload mechanics.A simple chatbot query and an agentic task are not the same unit of consumption. Treating them as interchangeable “AI requests” hides too much. An agent that calls a large model dozens of times is closer to a mini workflow than a search query, and its energy profile should be judged accordingly.
That does not automatically make agents wasteful. If an agent prevents a human from spending an hour on a task, catches a security misconfiguration, or automates a costly business process, the energy tradeoff may be reasonable. The correct comparison is not always AI versus nothing; sometimes it is AI versus meetings, rework, travel, downtime, or human labor performed with less precision.
But the burden of proof should shift. Vendors should not be able to sell agentic systems as pure productivity magic while burying the infrastructure cost in cloud opacity. If agents become a default interface for computing, users and organizations will need meaningful visibility into latency, cost, and energy intensity.
The industry has already learned to expose token counts, rate limits, and context windows to developers. The next layer may be energy-aware APIs and orchestration dashboards that show how many model calls a task required, which model tiers were used, how long accelerators waited, and whether the answer was worth the loop. That would turn efficiency from a public-relations slogan into an engineering metric.
The Next AI Race Is for Restraint
The most interesting phrase in the KAIST reporting is not “136.5 times.” It is “co-design.” Professor Yoo argued that as agents become common, AI models, data center infrastructure, and power infrastructure will need to be optimized together. That is a much broader claim than “make chips faster.”Co-design means the model architecture cannot be separated from the scheduler, the data center cannot be separated from the grid, and the product experience cannot be separated from the cost of inference. A system that chooses a smaller model for a simple task is not merely cheaper; it is better designed. A system that avoids unnecessary tool calls is not less intelligent; it is more disciplined.
This will be hard for an industry addicted to maximal demos. Agent showcases thrive on spectacular chains of action: open the browser, search the web, write code, run it, fix the error, prepare a report, send an email. The more steps, the more magical it looks. KAIST’s work reminds us that every step has a meter attached.
The likely future is not one agent to rule them all. It is a hierarchy of models and policies. Small local models will handle routine classification and drafting. Medium models will manage structured workflows. Large frontier models will be reserved for ambiguous, high-value, or high-risk tasks. External tools will be called when they are needed, not because the agent framework makes it easy.
For Windows users, that future may feel less glamorous than the current marketing. The best AI feature may be the one that quietly decides not to invoke the cloud. The best enterprise agent may be the one that solves 80 percent of tickets with a small model and escalates only the hard cases. The best developer assistant may be the one that knows when a static analyzer is cheaper and more reliable than another round of synthetic reasoning.
The Numbers Turn AI Policy Into Infrastructure Policy
National AI strategies often speak in the language of talent, chips, models, and regulation. KAIST’s projection forces a more physical vocabulary. Power generation, grid interconnection, land use, water, cooling, and regional capacity are now part of the AI stack.This is especially important for countries trying to build sovereign AI capability. Buying GPUs is difficult; powering them continuously is harder. A state-backed AI data center may look impressive on a press release, but if agent workloads multiply demand faster than efficiency improves, the bottleneck moves from semiconductor supply to electrical planning.
Companies face the same constraint in miniature. A CIO deciding whether to deploy agents across customer support, software development, security operations, finance, and HR is making an infrastructure decision even if the vendor sells it as a subscription feature. The cost will appear somewhere: in per-seat pricing, usage-based billing, throttled performance, regional availability, or reduced margins hidden inside a cloud bundle.
This may also change procurement language. Enterprises should ask vendors not only which model powers an agent, but how many model calls typical tasks require, what model tiers are invoked, whether smaller models are used by default, and what telemetry administrators can audit. “AI-ready” infrastructure should mean more than access to accelerators; it should mean policies for using them rationally.
The organizations that get this right will treat agent deployment like any other production system. They will profile workloads, set budgets, measure outcomes, and kill automations that do not earn their keep. The ones that get it wrong will discover that a thousand tiny conveniences can add up to one large power bill.
The Hippo in the Server Room Has a Few Clear Lessons
KAIST’s analysis should not be read as an obituary for AI agents. It should be read as the first serious invoice. The technology can still be useful, but the era of pretending that every extra reasoning step is an unpriced improvement should end.- AI agents are not just chatbots with better manners; they are multi-step workloads that can call large models dozens of times for a single user request.
- KAIST found that a 70-billion-parameter agent consumed an average of 348.41 watt-hours per question, up to 136.5 times more than a conventional generative AI query.
- The same study found response-time increases of up to 153.7 times and GPU standby time of up to 54.5 percent during external tool use.
- The most promising fix is not simply more data center capacity, but smarter routing among small, medium, and large models based on task difficulty.
- Enterprises deploying agents should demand telemetry for model calls, latency, cost, and workload behavior before treating agentic AI as ordinary software.
- The long-term AI race will reward systems that can reason effectively while refusing unnecessary computation.