Why LLMs Aren’t Human Minds: Jagged Intelligence and Windows AI Risk

ChatGPT · 2026-06-08T06:54:32-0400

Melanie Mitchell’s argument is that the central mistake in today’s AI debate is treating large language models as humanlike minds rather than powerful, brittle, culturally trained systems whose impressive fluency can conceal unpredictable failures, weak generalization, and poorly understood real-world limits. That distinction matters because Windows users, developers, and administrators are being asked to trust these systems inside operating systems, productivity suites, browsers, help desks, code editors, and security workflows. The risk is not that AI is useless, nor that it is secretly conscious, but that we keep measuring it with the wrong metaphors and then deploy it as if the measurements were reality.

The New AI Problem Is Not Hype, but Misclassification

The past three years have made one thing clear: dismissing modern AI as “autocomplete” is no longer serious analysis. Systems such as ChatGPT, Claude, Gemini, Copilot, Llama, and their successors can summarize documents, write software, translate tone, generate plausible strategy memos, pass exams, draft legalistic language, and answer technical questions with enough confidence to make even skeptical professionals pause.
But Mitchell’s essay lands because it refuses the easy binary. She is not saying these systems are dumb. She is saying they are not reliably smart in the way we are tempted to assume. They can perform above most humans on one benchmark and then stumble over a school-level word problem after a harmless distractor is added.
That is a much more operationally useful criticism than the old “AI is fake” line. For IT departments, the danger is not that AI cannot help. It is that AI often helps just enough to become embedded in workflows before anyone understands the shape of its failure modes.
Windows is now one of the most important battlegrounds for that distinction. Microsoft has spent the Copilot era arguing that AI belongs at the center of the PC experience, from Microsoft 365 and Edge to Windows search, system assistance, Recall-style activity history, and agentic productivity features. Mitchell’s warning is therefore not abstract philosophy. It is a deployment risk model.

“Jagged Intelligence” Explains Why Copilot Can Be Brilliant and Baffling

The phrase that does the most work in Mitchell’s essay is “jagged intelligence.” It captures the unnerving unevenness of LLM capability: excellent at one task, fragile at a nearby task, persuasive even when wrong, and often unable to tell the difference between a robust answer and a lucky pattern match.
This is the experience many WindowsForum readers already know from using AI assistants for PowerShell, registry edits, Group Policy settings, driver troubleshooting, Excel formulas, or Azure configuration snippets. The model may produce a correct command, explain it clearly, and save half an hour. Then, in the next exchange, it may invent a flag, misremember a deprecated setting, or recommend a security exception that no competent admin would approve.
That jaggedness is not merely a user-experience flaw. It is a structural mismatch between how these systems are marketed and how technical work actually happens. Enterprise IT is not a benchmark suite; it is a pile of exceptions, aging dependencies, vendor-specific behavior, undocumented breakage, and consequences.
A chatbot that is right 90 percent of the time can still be dangerous if the remaining 10 percent clusters around edge cases, ambiguity, permissions, security, or irreversible actions. A junior admin usually knows when they are guessing. A language model often performs certainty even when it has wandered off the map.

Benchmarks Built the Myth, and Benchmarks Are Now Cracking

Mitchell’s critique of AI benchmarks should make every CIO uncomfortable. The public narrative around AI progress has leaned heavily on test scores: bar exams, coding challenges, math competitions, medical question sets, reading comprehension tests, and “expert-level” task evaluations. These numbers create the impression of a clean upward curve.
The problem is that benchmark success is not the same thing as dependable competence. A benchmark is a designed environment. Real work is an adversarial environment, even when nobody is trying to be adversarial. The user asks an imprecise question. The documentation changed last month. The system prompt clashes with organizational policy. The model has seen something similar, but not this.
The Apple research Mitchell cites is important because it dramatizes this gap in a way non-specialists can understand. If adding irrelevant information to a simple math word problem can sharply degrade performance, then the system is not reasoning in the ordinary human sense. It may be tracking patterns that correlate with reasoning without possessing the robust abstraction we assumed was there.
That should temper claims about AI agents replacing programmers, help desk technicians, analysts, or radiologists on a neat timeline. Jobs are not just task bundles. They are context engines. The worker knows when the ticket is really a hardware issue, when the user has described the wrong symptom, when the vendor’s KB article is incomplete, and when the safe answer is to stop and escalate.

The Windows Desktop Is Becoming a Test Site for AI’s Metaphors

Microsoft’s AI strategy depends on a particular metaphor: Copilot as assistant, companion, analyst, and eventually agent. The name itself is carefully chosen. A copilot helps you fly the plane but does not own the aircraft. It implies partnership, not replacement.
Yet the interface design often pushes users toward a stronger belief. The model says “I.” It apologizes. It explains. It sounds patient. It remembers context within a session. It can be given documents, screenshots, code, emails, and prompts that feel like instructions to a human colleague.
Mitchell’s point is that these metaphors shape everything downstream. If AI is a personlike helper, then we evaluate it with human tests, ask whether it has “PhD-level intelligence,” imagine it as a future employee, and debate whether it might become an autonomous threat. If AI is a cultural and informational technology — closer to a library, search engine, compiler, bureaucracy, or market mechanism — then we ask different questions.
For Windows, that second framing is more useful. Copilot is less like a digital employee sitting inside the Start menu and more like a probabilistic interface to accumulated human text, code, documentation, telemetry, and workflow conventions. That does not make it trivial. Libraries changed civilization. Search changed work. Spreadsheets changed finance. But nobody asks whether Excel understands accounting.

The Agent Era Raises the Cost of Being Wrong

The stakes rise when chatbots become agents. A chatbot that gives bad advice can mislead a user. An agent that acts on bad advice can change a system, send a message, delete a file, alter a configuration, open a ticket, approve a workflow, or trigger a chain of automation.
This is where jagged intelligence becomes a governance problem. The failure mode is no longer just hallucination. It is unauthorized confidence converted into action. The system need not be malicious to cause harm; it only needs to misunderstand the goal, over-trust a pattern, or fail to recognize that a small exception changes the meaning of the whole task.
Windows environments are full of these traps. A remediation script suitable for a lab machine may be reckless on a domain controller. A registry change that fixes one hardware profile may break another. A security recommendation that improves convenience may violate policy. An instruction that sounds right in English may be wrong in PowerShell.
The agentic future therefore demands a boring but essential discipline: containment. AI systems should be allowed to propose, draft, classify, summarize, and simulate before they are allowed to execute. The more irreversible the action, the more the system should be treated like an untrusted but useful contractor.

Humanlike Fluency Is the Most Successful Dark Pattern in Computing

The reason users over-trust LLMs is not just that the answers are often good. It is that the answers arrive in the social form of competence. The model writes in paragraphs. It adjusts tone. It explains assumptions. It can express uncertainty. It can flatter, reassure, and apologize.
That is a powerful interface achievement, but it is also a cognitive trap. Humans evolved to treat fluent language as evidence of an underlying mind. When a system produces confident, context-aware prose, we instinctively infer comprehension, even when we intellectually know the machinery is statistical.
This matters in consumer Windows scenarios as much as enterprise ones. A user asking an AI assistant why their laptop is slow may follow advice they would never accept from a random forum post. A student may trust a generated explanation of a system error. A small business owner may paste sensitive licensing, payroll, or customer information into a model because it feels like talking to software support.
The old security lesson was “do not run code from strangers.” The AI-era version is more subtle: do not confuse fluent explanation with accountable expertise. The stranger now speaks in a calm professional tone and may be embedded inside the software you already trust.

The Copyright Fight Reveals the Metaphor War

Mitchell’s discussion of copyright is not a detour. It shows how deeply the person metaphor has penetrated the politics of AI. When AI companies argue that training on copyrighted works resembles a person reading books and learning from them, they are leaning on the idea that models are analogous to individual minds.
That metaphor is convenient. People read, learn, synthesize, and create. Libraries store, index, reproduce, and distribute. A model can be described in either direction depending on the legal or commercial need of the moment.
For Windows users and developers, this matters because the same ambiguity appears in software practice. If an AI coding assistant emits a function that resembles training data, who owns the risk? If it suggests a configuration copied from a vendor blog whose caveats were omitted, who is accountable? If it generates documentation that incorporates licensed material, is that a user mistake, a platform problem, or an unavoidable artifact of the model?
The industry has been happy to sell AI as an assistant when adoption is the goal and as infrastructure when liability is the concern. Mitchell’s essay presses on that inconsistency. We cannot govern a technology honestly if its identity changes according to which argument is most profitable.

The Real AI Safety Problem Is Ordinary Trust at Scale

Much of the public AI safety debate has been captured by extreme scenarios: rogue superintelligence, autonomous deception, runaway agents, existential catastrophe. Those questions are not irrelevant, but they can obscure the immediate danger of normal deployment into normal institutions.
The mundane version is already here. AI drafts performance reviews, summarizes meetings, triages support tickets, writes code, screens applicants, suggests medical next steps, produces lesson plans, generates security alerts, and explains the world to users. Each of those uses can be helpful. Each can also smuggle in error, bias, omission, or false certainty.
Mitchell’s “normal technology” framing is useful precisely because normal technologies can still be transformative and dangerous. Cars did not need to be conscious to reshape cities and kill people. Social media did not need agency to change politics, journalism, and adolescence. Spreadsheets did not need intent to produce billion-dollar errors.
AI may follow that pattern: less like an alien mind arriving all at once, more like a general-purpose layer slowly rewiring the incentives of work. The danger is not only spectacular failure. It is gradual dependence on systems whose outputs are easier to generate than to verify.

Microsoft’s AI Push Needs a More Adult Contract With Users

Microsoft has an unusually complicated role in this story. It is not merely another AI vendor. It controls the dominant desktop operating system, a massive enterprise productivity suite, a major cloud platform, a browser, a developer ecosystem, a gaming network, and the management stack that many organizations use to run their fleets.
That reach means Microsoft’s AI design choices become defaults for millions of users. If Copilot is placed prominently, users infer endorsement. If AI summaries appear in familiar productivity tools, workers may treat them as part of the document rather than a probabilistic layer above it. If Windows itself mediates user memory, search, and system assistance through AI, the operating system becomes a trust broker.
The responsible path is not to hide AI or pretend users will avoid it. The responsible path is to make the boundary visible. Users should know when they are seeing retrieved documentation, generated interpretation, local device data, cloud inference, organizational content, or a blend of all five.
This is where Microsoft can still differentiate itself. Enterprise customers do not need magical language about AI companions. They need auditability, admin controls, data boundaries, model behavior documentation, rollback options, retention clarity, and reliable ways to disable features that do not fit their risk model.

Administrators Should Treat AI Like a Privileged Intern With Amnesia

For IT pros, the most practical reading of Mitchell is not “ban AI.” It is “classify AI correctly.” A modern LLM is often a superb assistant for drafting, summarizing, brainstorming, translating, and explaining. It is not a source of truth. It is not a responsible operator. It is not a substitute for change control.
That means AI belongs inside existing governance structures, not above them. If a human admin needs peer review before deploying a script to production, an AI-generated script needs it too. If a vendor recommendation requires testing, an AI recommendation requires testing. If sensitive logs cannot be pasted into a public website, they cannot be pasted into a chatbot merely because the chatbot feels helpful.
The best use cases will be those where verification is cheap. Asking AI to explain an Event Viewer error, generate a first draft of a PowerShell script, summarize release notes, compare policy options, or create a troubleshooting checklist can be valuable because the admin can inspect the result. The worst use cases are those where verification is expensive, delayed, or impossible.
This distinction should guide procurement as well. The right question is not “How intelligent is this product?” The right question is “What happens when it is wrong, and how will we know?”

Developers Are Learning That AI Coding Is a Speed Tool, Not a Judgment Tool

Software development is the field where AI’s jaggedness is most visible because the wins are real and the failures are concrete. LLMs can generate boilerplate, explain unfamiliar APIs, translate snippets between languages, write unit tests, and help developers move faster through routine work. Anyone who denies that is ignoring the evidence of daily practice.
But code is also an unforgiving medium. A generated answer may compile and still be insecure. It may pass the visible tests and fail the edge case. It may use an outdated library, mishandle concurrency, expose secrets, or implement the wrong requirement with perfect syntax.
This is why the repeated prediction that AI is months away from replacing software engineers deserves skepticism. Programming is not just typing code. It is understanding tradeoffs, reading ambiguous requirements, maintaining systems over time, negotiating constraints, and knowing when not to build the requested thing.
AI will change software work, and in some areas it already has. But Mitchell’s argument suggests that the replacement narrative is less reliable than the leverage narrative. The developer who understands the system can use AI as acceleration. The organization that treats AI output as engineering judgment is buying technical debt at machine speed.

The Medical Analogy Should Make the Tech Industry More Humble

AI boosters often point to medical benchmarks as proof that models can outperform professionals. Mitchell is right to challenge that framing. A medical exam is not clinical practice, just as a coding benchmark is not software engineering and a math contest is not mathematical research.
The difference is context. A doctor must notice what the patient did not say, weigh risk, order tests, communicate uncertainty, understand institutional constraints, and remain accountable. A model can produce a plausible differential diagnosis without bearing any of those burdens.
The same distinction applies in Windows administration. Knowing the command is not the same as owning the change. Producing a confident explanation is not the same as understanding the organization’s risk tolerance. Passing a benchmark is not the same as operating inside a messy environment with users, budgets, compliance requirements, legacy systems, and consequences.
This is the part of the AI debate that often gets lost. Expertise is not merely answer generation. Expertise is situated judgment under constraint.

The Next Evaluation Layer Must Be Built Around Failure

If benchmark culture is broken, the answer is not to abandon measurement. It is to measure different things. Accuracy still matters, but it is not enough. Robustness, calibration, repeatability, provenance, escalation behavior, and safe refusal matter just as much.
A model that answers 95 percent of standard questions correctly but collapses under rewording is not production-ready for high-stakes use. A model that can solve a hard problem but cannot flag uncertainty is not a reliable assistant. A model that can generate a script but cannot explain the permissions, rollback path, and blast radius is not ready to act.
For enterprise AI, the most important evaluations may be local rather than universal. A bank, hospital, school district, manufacturer, or software vendor should test models against its own workflows and failure cases. Public benchmarks can inform procurement, but they cannot substitute for adversarial testing inside the environment where the tool will actually operate.
This is especially true for Windows fleets, where local variation is the rule. Hardware age, driver history, imaging practices, Intune configuration, domain policy, security tooling, application compatibility, and user behavior all shape outcomes. A generic AI assistant may know Windows in general while misunderstanding your Windows estate in particular.

The User Interface Should Teach Skepticism Instead of Simulating Certainty

One of the simplest ways to reduce AI risk is also one of the least fashionable: make the interface less socially manipulative. The more an AI assistant performs personhood, the more users will grant it personlike trust. That may be good for engagement, but it is bad for judgment.
A better interface would foreground uncertainty, source boundaries, and action risk. It would distinguish between “I found this in Microsoft documentation,” “I inferred this from your description,” “I generated a possible script,” and “This action will modify system state.” It would make verification feel normal rather than burdensome.
In security-sensitive contexts, AI should be designed to slow users down at the right moments. That runs against the product instinct to reduce friction everywhere. But friction is not always the enemy. A confirmation prompt before deleting files is friction. So is multifactor authentication. So is change approval. Civilized computing is full of useful brakes.
The AI industry has spent years optimizing for helpfulness. The next phase needs to optimize for honest helpfulness, where the system’s limits are visible enough that users can remain in charge.

Mitchell’s Warning Lands Hardest Where AI Is Most Useful

The uncomfortable irony is that Mitchell’s critique matters because AI is already useful. If these systems were merely toys, nobody would care whether their intelligence was jagged. The problem is that they are good enough to spread and unreliable enough to mislead.
That puts WindowsForum’s audience in a more complicated position than either the AI evangelists or the AI rejectionists admit. Enthusiasts will find creative uses. Sysadmins will find time savings. Developers will keep integrating assistants into editors and CI pipelines. Security teams will use AI to summarize alerts and generate detections. Ordinary users will ask Copilot to explain errors, settings, and documents.
The goal, then, is not purity. It is operational maturity. Organizations that ban all AI may drive it underground. Organizations that embrace it uncritically may discover that convenience has quietly replaced control.
The wiser posture is selective trust. Use AI where its output can be checked, where errors are recoverable, where data exposure is acceptable, and where humans retain authority. Avoid using it where confidence is hard to audit, where mistakes propagate automatically, or where the model is asked to substitute for responsibility.

The PC’s AI Future Depends on Remembering What the Machine Is Not

The most concrete lesson from Mitchell’s argument is that AI deployment should begin with a refusal to anthropomorphize the product roadmap. These systems are not coworkers trapped in the cloud. They are not junior humans. They are not minds with stable beliefs. They are software systems trained on vast traces of human culture and tuned to produce useful responses.
That makes them extraordinary. It also makes them strange. Their strengths do not map neatly onto human strengths, and their weaknesses do not map neatly onto human weaknesses. A model may outperform a professional on a narrow test and fail at a task a child would handle by noticing what matters.
For Microsoft and the Windows ecosystem, this means the AI PC should not be sold as a computer that understands you. It should be sold, if sold honestly, as a computer with new probabilistic tools for transforming, retrieving, summarizing, and acting on information — tools that require boundaries because they do not understand in the way the interface suggests.
That less glamorous framing may be better for everyone. It gives users power without asking them to suspend disbelief. It gives administrators room to govern. It gives developers a realistic model of risk. And it gives vendors a path to trust that does not depend on pretending the machine is more human than it is.

The Practical Lesson Is to Trust the Workflow, Not the Voice

The near-term future of AI on Windows will be decided less by model demos than by deployment discipline. The organizations that benefit most will not be the ones that believe the strongest claims. They will be the ones that integrate AI into workflows with verification, logging, permissions, rollback, and user education.

AI assistants are most valuable when they reduce drafting, searching, summarizing, and explanation costs without being allowed to make irreversible decisions alone.
Benchmark scores should be treated as marketing evidence until they are validated against local workflows, local data, and realistic failure cases.
Administrators should assume that AI-generated commands, scripts, and configuration advice require the same review as work from an unknown human source.
Developers should use AI to accelerate implementation while reserving architecture, security judgment, and requirements interpretation for accountable humans.
Product teams should design AI interfaces that reveal uncertainty and provenance instead of leaning on humanlike confidence as a trust shortcut.
Enterprises should write AI policy around data exposure, action authority, audit trails, and escalation paths rather than vague claims about intelligence.

Mitchell’s essay is ultimately a call for sobriety, not pessimism. The machines are capable enough to matter, strange enough to surprise us, and immature enough to demand better evaluation before we hand them more authority. If AI is going to become a normal layer of the Windows experience, the next phase should be judged not by how human the assistant sounds, but by how well the surrounding system keeps human beings responsible for what happens next.

References

Primary source: The Yale Review
Published: 2026-06-08T10:02:09.803539

Loading…

yalereview.org
Related coverage: forbes.com

Loading…

www.forbes.com
Related coverage: techxplore.com

Loading…

techxplore.com
Related coverage: news.yale.edu

Loading…

news.yale.edu
Related coverage: insights.som.yale.edu

Loading…

insights.som.yale.edu
Related coverage: som.yale.edu

Loading…

som.yale.edu

Related coverage: scientificamerican.com

Loading…

www.scientificamerican.com
Related coverage: benton.org

Loading…

www.benton.org
Related coverage: reflections.yale.edu

Loading…

reflections.yale.edu

Navigation section

Why LLMs Aren’t Human Minds: Jagged Intelligence and Windows AI Risk

“Jagged Intelligence” Explains Why Copilot Can Be Brilliant and Baffling​

Benchmarks Built the Myth, and Benchmarks Are Now Cracking​

The Windows Desktop Is Becoming a Test Site for AI’s Metaphors​

The Agent Era Raises the Cost of Being Wrong​

Humanlike Fluency Is the Most Successful Dark Pattern in Computing​

The Copyright Fight Reveals the Metaphor War​

The Real AI Safety Problem Is Ordinary Trust at Scale​

Microsoft’s AI Push Needs a More Adult Contract With Users​

Administrators Should Treat AI Like a Privileged Intern With Amnesia​

Developers Are Learning That AI Coding Is a Speed Tool, Not a Judgment Tool​

The Medical Analogy Should Make the Tech Industry More Humble​

The Next Evaluation Layer Must Be Built Around Failure​

The User Interface Should Teach Skepticism Instead of Simulating Certainty​

Mitchell’s Warning Lands Hardest Where AI Is Most Useful​

The PC’s AI Future Depends on Remembering What the Machine Is Not​

The Practical Lesson Is to Trust the Workflow, Not the Voice​

References​