Age of Empires II Goat Computer Paper Challenges Human-Like AI Claims

Adrian de Wynter’s new paper, “If LLMs Have Human-Like Attributes, Then So Does Age of Empires II,” published in late May 2026, uses Microsoft’s classic real-time strategy game to challenge how AI researchers attribute human qualities to large language models. The joke is the delivery mechanism, not the argument. By building working computation inside Age of Empires II: Definitive Edition, de Wynter turns a beloved strategy sandbox into a critique of a much larger scientific habit: confusing behavior, interface, and metaphor for evidence of mind.
The result is one of those rare AI papers that works because it is funny and because the joke gets sharper the longer you sit with it. If a neural network implemented with goats, palisade walls, terrain, and scenario triggers can satisfy the same style of reasoning used to describe LLMs as anxious, moral, self-aware, or socially intelligent, then perhaps the problem is not that Age of Empires II has hidden personhood. Perhaps the problem is that parts of AI research have become too comfortable treating anthropomorphic language as a finding rather than a hypothesis.

Strategy game scene showing goat-powered “packet delivery” with visible neural-network and Nand gate UI overlays.The Goat Computer Is a Gag With Teeth​

The core demonstration sounds like a late-night forum challenge: build a working neural network inside Age of Empires II. De Wynter does this by using the game’s scenario editor to create logic gates, including NAND gates, and then composing those primitive operations into a simple trainable system. The details matter because NAND gates are not decorative; they are functionally universal building blocks in digital logic.
In plain English, if you can build reliable NAND gates and connect them at sufficient scale, you can in principle build arbitrary computation. That does not mean a medieval goat economy is about to host the next frontier model. It means the underlying substrate can be made to perform the formal operations required for computation, however painfully and absurdly.
The paper’s best move is that it refuses to leave the joke at “look, a video game can compute.” Many systems can compute, and enthusiasts have proven similar points in Minecraft, cellular automata, spreadsheets, and other unlikely places. What makes the Age of Empires II version more pointed is that de Wynter connects the computational proof to how claims about LLMs are framed.
Once a system is computationally universal, the boundary between “this substrate can implement intelligence-like behavior” and “this substrate therefore has human-like attributes” becomes slippery. That slipperiness is exactly the target. De Wynter is not saying your scout cavalry has moral agency or that the sheep near your town center are secretly doing phenomenology. He is saying that if a method would force you to accept those conclusions in a game engine, then the method may be smuggling in the answer before the experiment begins.
The goats are funny because they make the smuggling visible. When an LLM produces fluent prose, users and researchers can forget they are interacting with a statistical system mediated through an interface designed to be socially legible. When a goat moves down a rail in Age of Empires II, nobody is tempted to say it is anxious. The absurdity strips away the glamour.

AI’s Anthropomorphism Problem Is Not Just Marketing​

The public debate about AI anthropomorphism often focuses on consumer products. Chatbots say “I,” apologize, flatter, remember your preferences, and increasingly present themselves as companions, tutors, therapists, advisers, and workplace deputies. The design incentives are obvious: a system that feels personable is easier to use, easier to sell, and easier to trust.
But de Wynter’s paper points to a more uncomfortable problem inside the research culture itself. If studies begin with the premise that an LLM has something like empathy, anxiety, theory of mind, moral reasoning, or self-awareness, the experiment can become a measuring device for the researcher’s assumptions. A benchmark built to detect a human trait may instead detect a model’s ability to imitate the linguistic surface of that trait.
That distinction is not pedantry. It is the difference between saying “this system can produce text that humans interpret as empathetic” and saying “this system has empathy.” The first is an empirical claim about observable behavior. The second is a much stronger claim about internal properties, and it demands a much heavier evidentiary burden.
The paper reportedly reviews hundreds of recent AI papers and finds a striking pattern: when researchers explicitly set out to test whether LLMs have human-like properties, many conclude that they do. That does not automatically make the work invalid. It does, however, raise the familiar scientific alarm bell of confirmation bias, especially in a field where benchmarks can be fragile, prompts can steer outcomes, and the object under study is optimized to produce plausible human language.
The irony is that AI researchers know this problem well in other contexts. They warn users not to confuse fluency with truth, confidence with accuracy, or chain-of-thought text with actual reasoning. Yet the temptation returns when the outputs look psychologically rich. A model that says the right things in the right tone can make even technical observers reach for human vocabulary before they have earned it.

The Paper Attacks a Premise, Not the Existence of Machine Intelligence​

The easiest misreading of de Wynter’s argument is that it is a dunk on AI itself. It is not. The paper does not prove that LLMs are useless, unintelligent, or incapable of supporting systems that display genuinely interesting emergent behavior. Nor does it settle old philosophical fights about consciousness, computation, or whether minds can be implemented in non-biological substrates.
Its target is narrower and more damaging: the habit of treating human-like attributes as substrate-independent when convenient, then retreating to human intuition when the same logic leads to absurd places. If a test says an LLM has anxiety because it produces certain responses under certain conditions, what prevents an equivalent computational process in a different substrate from receiving the same label? If the answer is “because this one looks like a chatbot and that one looks like goats in a video game,” then the criterion is not scientific enough.
That is why Age of Empires II is such a clever choice. It is not a purpose-built AI laboratory. It is a 1999 real-time strategy game, modernized in Definitive Edition form, beloved for build orders, villager pathing, monk conversions, and competitive play. Its interface carries none of the cultural cues we associate with artificial minds.
By forcing computation through that interface, de Wynter separates the formal system from the social theater around it. The same kind of claim that sounds plausible when made about a conversational model begins to sound ridiculous when made about a map editor contraption. The paper’s wager is that the ridiculous version reveals something about the plausible one.
This does not mean every anthropomorphic term should be banned. Scientists use metaphors all the time. We say systems “learn,” “remember,” “attend,” and “prefer,” even when those words do not map cleanly onto human cognition. The issue is whether the metaphor is being used as shorthand for a defined mechanism or quietly promoted into evidence of a mental state.

The Null Assumption Is Boring, Which Is Why It Matters​

De Wynter’s practical recommendation is essentially methodological discipline: begin with a null assumption rather than a romantic one. Do not start from “LLMs are human-like and now we will measure how.” Start from “we do not know whether this attribute applies, and our test must be capable of disconfirming it.” That sounds obvious because it is the skeleton of good empirical work.
The problem is that frontier AI has made obvious discipline harder to maintain. The systems are compelling. They write, summarize, argue, translate, code, role-play, and adapt to context in ways that feel socially meaningful. The more human the interaction feels, the more effort it takes to keep the hypothesis separate from the experience.
A null assumption does not require hostility toward AI. It simply requires that researchers define what would count as evidence against their preferred interpretation. If a model’s “moral reasoning” disappears when the prompt is reframed, when answer choices are shuffled, when the benchmark leaks into training data, or when the task is moved outside familiar linguistic patterns, then perhaps the measured thing was not morality.
The same applies to claims of anxiety, empathy, self-awareness, and agency. A model can produce anxious language without being anxious. It can generate compassionate advice without compassion. It can discuss itself without possessing a self. These are not anti-AI slogans; they are distinctions any serious account of artificial cognition has to preserve.
The challenge for the field is that weaker claims are less exciting. “This model reliably simulates a subset of human-labeled empathetic responses under benchmark conditions” is not as headline-friendly as “AI shows empathy.” But the boring sentence is closer to science, and the exciting one is closer to marketing.

Microsoft’s Own Research Culture Is Complicating Microsoft’s AI Story​

One of the most interesting aspects of the paper is its authorial context. De Wynter is associated with Microsoft and the University of York, and Microsoft is not a neutral bystander in the AI boom. The company has invested heavily in OpenAI, built Copilot into Windows and Microsoft 365, and pushed generative AI across developer tools, search, productivity software, and enterprise services.
That makes the paper more credible, not less. It is not written from the cheap seats by someone with no stake in whether AI systems matter. It comes from inside the broader ecosystem that benefits when AI appears capable, agentic, and indispensable. A Microsoft-affiliated scientist using Age of Empires II to puncture sloppy anthropomorphism is exactly the kind of internal tension mature technology companies need.
It also reflects a wider split in the AI conversation. On one side, product teams are incentivized to make systems feel more human and more autonomous. On the other, safety researchers, evaluators, and many working scientists keep warning that social fluency is not the same thing as grounded understanding. The resulting message to users can be incoherent: trust the assistant enough to put it in your workflow, but do not trust it so much that you forget it is not a person.
Windows users have already seen the practical version of this tension. Copilot is not sold merely as a text box. It is framed as a helper that can understand intent, summarize activity, draft messages, surface settings, and eventually act across applications. The more deeply these systems integrate into the operating system, the more important it becomes to describe their abilities precisely.
If Microsoft wants AI in the shell, the browser, Office documents, developer environments, and security tooling, it cannot afford a research culture that casually blurs simulation and possession. Enterprise buyers will not manage risk around vibes. They need to know what a system does, where it fails, how it is evaluated, and whether claims about its “reasoning” or “understanding” are operationally meaningful.

The Old RTS Becomes a Better Mirror Than the Chatbot​

There is a reason this paper traveled farther than a normal methodology critique. Age of Empires II has cultural weight. It is not just a substrate; it is a shared memory for a generation of PC players who learned hotkeys, economy management, and historical campaigns long before “AI alignment” became a boardroom phrase.
That nostalgia makes the paper approachable, but it also makes its argument harder to dismiss. A NAND gate made of game objects is concrete. A perceptron built in the scenario editor is something readers can picture. “Goats as signal carriers” does more explanatory work than another abstract paragraph about computational theory.
The humor also lowers the temperature of a debate that has become weirdly moralized. AI skeptics sometimes overstate their case, acting as if every claim of machine capability is a scam. AI boosters sometimes do the opposite, treating skepticism as denialism in the face of inevitable intelligence. De Wynter’s paper sidesteps that trench warfare by creating a third object: a ridiculous but functioning computational system that forces both sides to sharpen their language.
For skeptics, the lesson is not “LLMs are just goats.” That is a satisfying meme but a poor analysis. Modern LLMs are vastly more capable, useful, and economically consequential than a handmade Age of Empires II circuit. Their failures are serious precisely because their successes are real enough to put them into workflows.
For boosters, the lesson is more painful. If your evidence for a human-like attribute survives only when the system speaks in polished natural language, you may be measuring the persuasive power of the interface. The goat computer asks whether the same conclusion would hold when the human cues are removed.

The Real Risk Is Policy Built on Personhood Theater​

The stakes are not confined to academic word choice. Public policy, platform rules, product safety, labor decisions, and user behavior are increasingly shaped by assumptions about what AI systems are. If the systems are framed as quasi-persons, people will treat them as companions, colleagues, witnesses, advisers, or authorities in ways that may exceed their actual reliability.
That matters for consumer safety. People already use chatbots for emotional support, medical triage, legal guidance, workplace conflict, romantic advice, and decisions involving vulnerable family members. Even when products include disclaimers, the conversational form invites a kind of trust that a search result or spreadsheet does not.
It matters for enterprise governance as well. A company that believes an AI assistant “understands” policy may delegate review tasks that require accountability. A security team that believes a model “reasons” about threats may overvalue its explanations. A manager who believes an agent “knows” a workflow may miss the brittle chain of permissions, prompts, retrieval steps, and API calls underneath.
The anthropomorphic frame also creates strange ethical distractions. If the public debate becomes fixated on whether current chatbots deserve empathy, it can crowd out more immediate human concerns: data extraction, labor displacement, surveillance, dependency, error propagation, and the concentration of infrastructure power. The illusion of a mind can pull attention away from the institutions deploying the system.
De Wynter’s paper does not solve these problems, but it offers a useful diagnostic. Before asking whether an AI system has a human-like property, ask whether the same test would force you to attribute that property to a bizarre computational implementation with none of the social cues. If yes, the test may be about behavior under a definition so broad that it has lost the human meaning it was supposed to capture.

The Goats Leave AI Researchers With Less Room to Hide​

The cleanest takeaway from the Age of Empires II experiment is not that machine intelligence is impossible, nor that all human-like AI claims are nonsense. It is that extraordinary labels need stronger methods than a fluent transcript and a suggestive benchmark. The paper’s comedy works because it exposes how much of the debate depends on which interface happens to be in front of us.
  • A working computational system inside Age of Empires II can make formal arguments about substrate-independence feel less abstract and more uncomfortable.
  • Claims that LLMs possess empathy, anxiety, morality, or self-awareness should distinguish observable imitation from internal attributes.
  • AI evaluations should begin from a null assumption and define what evidence would count against the claim being tested.
  • Microsoft’s involvement makes the critique more significant because it comes from within a company aggressively commercializing generative AI.
  • For Windows users and IT administrators, the practical issue is not whether AI sounds human, but whether its capabilities are reliable, bounded, auditable, and honestly described.

A Better AI Debate Starts by Retiring the Magic Words​

The strange beauty of de Wynter’s paper is that it turns a 27-year-old strategy game into a philosophical solvent. The goats do not answer whether machines can think, whether consciousness is computable, or whether future AI systems may deserve moral consideration. They do something more immediately useful: they make sloppy claims look sloppy.
That is badly needed in 2026. The AI industry is still racing to make systems more agentic, more intimate, more embedded, and more difficult to separate from ordinary software. Microsoft, Google, OpenAI, Anthropic, Meta, and countless smaller companies all have incentives to make these systems feel less like tools and more like collaborators. Some of that will be useful. Some of it will be manipulative. Much of it will be hard to govern unless the vocabulary improves.
The vocabulary starts with restraint. Say a model generates empathetic text, not that it has empathy. Say it follows patterns associated with moral reasoning, not that it has morals. Say it maintains a self-referential dialogue, not that it has a self. These distinctions may sound fussy until a product team, regulator, school district, or hospital builds policy on the fuzzier version.
Age of Empires II will not become the next AI platform, and the goats will not replace GPUs. But as a piece of scientific satire, de Wynter’s experiment lands because it makes the invisible premise visible. If the same argument that makes a chatbot sound human also makes a medieval RTS map sound human, the problem is not with the goats. It is with the argument.
The next phase of AI will need better benchmarks, clearer claims, and less willingness to let the user interface do the metaphysical work. The systems are already powerful enough to matter without pretending they are people. If a handful of goats wandering through a custom scenario can remind the field of that, then Age of Empires II has contributed something more valuable than another castle drop: it has forced AI research to look at its own reflection before declaring that reflection alive.

References​

  1. Primary source: Windows Central
    Published: 2026-06-20T13:55:08.430867
  2. Related coverage: researchgate.net
  3. Related coverage: themoonlight.io
  4. Related coverage: aitoolly.com
  5. Related coverage: themodelwire.com
  6. Related coverage: korben.info
  1. Related coverage: izmirakademi.org
  2. Related coverage: itbooks.ir
 

Back
Top