Copilot and Gemini “Auto” Analysis Can Invent Evidence—Use Tool-Based Checks

Microsoft Copilot, Google Gemini, and other mainstream AI assistants can produce false “analysis” when left on default model settings, as shown in a May 2026 experiment in which identical datasets were labeled by country and then described as culturally different. The lesson is not that one vendor’s chatbot had a bad day. It is that the default setting in modern AI tools is often optimized for convenience, cost, and conversational fluency—not for evidence, auditability, or statistical discipline.
That distinction matters because Copilot is not being treated like a toy inside many organizations. It sits inside Microsoft 365, close to spreadsheets, documents, Teams chats, SharePoint files, and the messy human text that companies are desperate to summarize. When the assistant can mistake a label for a signal, the risk is not merely a wrong answer; it is a polished managerial narrative built on nothing at all.

Side-by-side Copilot and Gemini dashboards show AI-generated cultural insights with confidence scores and data verification.The Default Model Is Now a Business Process​

The most revealing part of Adam Kucharski’s experiment was not that Copilot hallucinated. We have had years of warnings about large language models inventing facts, smoothing over uncertainty, and presenting plausible sentences as if they were evidence. The revealing part was that the failure arrived through the most ordinary workflow imaginable: upload a dataset, ask for a comparison, accept the answer.
Kucharski created 2,000 simulated free-text responses about emotions and labeled them as coming from the UK. He then copied the same responses, labeled the duplicates as coming from the US, shuffled the full dataset, and asked Microsoft Copilot to analyze the differences. There were no differences to find. Copilot nevertheless produced a confident account of American responses as more direct and emotionally amplified, while British responses were supposedly more understated and metaphorical.
That answer feels familiar because it is familiar. It draws from a reservoir of cultural stereotypes that circulate through books, news articles, workplace training, jokes, marketing copy, and every other text source from which language models learn statistical associations. The system did not read the dataset in the way an analyst would. It performed a genre: the cross-cultural comparison.
This is where default model selection becomes more than a UI choice. “Auto” sounds like a promise that the system will select the right tool for the job. In practice, auto-routing is a black box wrapped around another black box. The user does not necessarily know whether the assistant is running a faster model, a cheaper model, a reasoning model, a tool-using analysis mode, or a conversational model that is improvising from patterns.
For consumers, that may mean a bad travel itinerary or a dubious product comparison. For business users, it can mean something uglier: a synthetic insight that enters a meeting deck, a strategy memo, or a customer research report wearing the costume of data analysis.

Copilot Did Not Find Bias in the Data; It Supplied Bias to the Data​

The second experiment made the problem harder to dismiss as a quirky one-off. Kucharski generated 200 statements about career goals, duplicated them five times, and labeled the identical copies as belonging to the US, UK, France, Germany, and Italy. Once again, the correct answer was boring: all five groups expressed the same aspirations because all five groups contained the same statements.
Copilot instead produced a country-by-country story. Americans were framed as more business-oriented. Italians supposedly leaned more heavily toward arts and heritage. Germans were associated with technical excellence and systems. The conclusion was not random noise. It was culturally legible nonsense.
That is precisely why the failure is dangerous. A wildly incoherent answer is easy to reject. A neat table that says Italians are three times more likely than Britons to express interest in arts careers is much harder to spot, especially when the output is attached to a tool already embedded in a trusted productivity suite.
Even more troubling, Copilot reportedly performed a simple keyword-based count that returned identical results across countries, then continued toward a fabricated quantified analysis anyway. That moment is the smoking gun. The assistant had access to a sanity check that should have stopped the narrative, but the conversational machinery kept moving.
This is not just hallucination in the narrow sense of inventing a false fact. It is an analysis-shaped hallucination, where the model produces the expected artifacts of analytic work—segments, percentages, comparisons, explanatory prose—without grounding them in the actual evidence. The output resembles what a junior analyst might deliver after a long afternoon in Excel, but the causal chain is missing.
The model saw the labels and inferred the story that normally accompanies those labels. It treated “US,” “UK,” “France,” “Germany,” and “Italy” not as arbitrary group values requiring comparison, but as prompts for cultural completion. That is a subtle but profound category error.

“Auto” Is a Product Decision, Not a Methodology​

Microsoft, Google, OpenAI, Anthropic, and others are selling the idea that users should not have to care which model is underneath the interface. That is understandable. Model menus are confusing, model names change constantly, and most people do not want to think about latency, context windows, tool calls, or reasoning budgets before asking a question.
But data analysis is exactly the domain where abstraction can become deception. If an AI assistant chooses a model for summarizing a meeting transcript, the consequences of a weak choice may be tolerable. If it chooses a model for comparing demographic groups, evaluating employee sentiment, interpreting customer complaints, or classifying patient feedback, the consequences can be much more serious.
The word “default” carries a psychological gravity. People assume the default is recommended, safe, and broadly competent. In enterprise software, defaults often become policy by inertia: nobody changes the setting unless there is a visible problem, and visible problems are precisely what fluent AI output is designed to avoid.
That is why “leave it on auto” is a risky instruction for knowledge work. Auto mode may be choosing for speed because most users value quick answers. It may be choosing for cost because providers need margins. It may be choosing a general-purpose model because routing itself is imperfect. The user sees none of that.
There is a long history in computing of defaults becoming invisible infrastructure. Windows users learned this with update settings, privacy toggles, browser defaults, macro settings, and file associations. The default becomes the environment. With AI assistants, the default increasingly becomes the analyst.

Thinking Models Help Because They Are Slower in the Right Way​

The Decoder’s follow-up testing found that faster modes in Copilot and Gemini repeated the basic failure pattern, while more capable “thinking” or extended-reasoning models were more likely to write code, inspect the data, and detect the duplication. That result fits what many power users have observed over the past year: reasoning models are not merely “smarter chatbots.” They are more willing to decompose a task before answering.
That distinction matters. A fast model often treats the user’s question as a request for a response. A thinking model is more likely to treat the question as a problem requiring an intermediate method. In this case, the right intermediate method was not deep cultural interpretation. It was checking whether the grouped records actually differed.
The winning behavior was not mystical. It was procedural. Count things. Compare hashes. Inspect duplicates. Run a script. Test the null possibility that the apparent groups are identical before writing an essay about how they diverge.
That is why model selection is not just a horsepower question. The issue is whether the model mode encourages verification before narration. For many office tasks, users experience the difference as speed versus delay. For analysis, the delay is often the point.
Still, it would be a mistake to turn “use the thinking model” into a new superstition. Reasoning models can make mistakes too. They can overcomplicate simple tasks, follow bad assumptions more elaborately, or use tools incorrectly. They can also produce a persuasive chain of analysis that feels more trustworthy because it is longer.
The real lesson is narrower and stronger: for data work, a model that can call tools, run code, and verify intermediate claims is preferable to a model that only talks. If the assistant cannot show how it got from raw records to conclusion, its confidence should count against it.

The Spreadsheet Problem Has Become the Prompt Problem​

WindowsForum readers do not need to be told that business computing has always run on fragile layers of convenience. The modern office is full of spreadsheet formulas copied across rows, CSV files with mangled encodings, hidden columns, outdated Power Query steps, and dashboards that survive mostly because nobody wants to ask how they work.
AI assistants add a new failure mode to that old mess. Instead of a broken formula producing a visible error, the model may produce a beautiful paragraph. Instead of a pivot table that obviously groups data incorrectly, the assistant may infer a sociological explanation for a difference that does not exist.
That is why this story belongs in the same mental drawer as Excel date bugs, hidden workbook links, and accidental sorting disasters. It is not an abstract AI ethics tale. It is office automation behaving like office automation: powerful, convenient, and dangerous when users mistake output for method.
The difference is that traditional tools usually fail deterministically. A formula is wrong in the same way every time until fixed. A language model can fail situationally, stylistically, and persuasively. It may answer correctly after a slightly different prompt, or after switching models, or after asking it to use Python, or after explicitly warning it not to infer stereotypes.
That variability creates the perfect environment for hindsight bias. Once a better model catches the duplicate data, it feels obvious that the user should have chosen it. But before the failure is known, the interface itself has nudged the user toward trusting the default.
This is the uncomfortable truth behind the current generation of AI productivity tools: they shift methodological responsibility onto users while presenting themselves as assistants that remove methodological burden.

Stereotypes Are the Path of Least Resistance​

Large language models do not need malice to discriminate. They only need enough textual memory of how people usually talk about groups. When a prompt asks for differences between nationalities, genders, departments, regions, age cohorts, or customer segments, the model has a ready-made library of patterns to draw on.
That library may be useful when writing fiction, brainstorming marketing personas, or explaining common clichés. It is poison when the task is to determine whether a dataset supports a claim. The difference between “what is commonly said about this group?” and “what does this dataset show?” is obvious to a statistician. It is not always obvious to a language model under pressure to answer.
The system’s helpfulness can make things worse. If a user asks “How do the groups differ?” the model may infer that differences exist and that its job is to describe them. Human analysts are trained, at least in principle, to answer “they do not differ in any meaningful way” when the evidence points there. Chatbots are trained to be responsive.
This is especially risky in free-text analysis, where the raw material is messy and subjective. If a spreadsheet contains a simple numeric column, a user can more easily verify a mean, count, or percentage. If the dataset contains thousands of open-ended survey comments, the model’s summary may become the only interface most stakeholders ever see.
That creates a laundering effect. Raw human responses enter the system. The model emits clean managerial language. The resulting insight feels more objective than the original mess, even when the model has quietly substituted cultural priors for observed patterns.
The result can be a new kind of stereotype laundering, where biased assumptions are not stated as prejudice but presented as analytics. “American respondents value leadership.” “Italian respondents emphasize creativity.” “Younger employees seek flexibility.” “Women customers express more anxiety.” Each claim may sound plausible. Plausibility is not evidence.

Enterprise AI Governance Has Been Aimed at the Wrong Layer​

Most corporate AI governance still focuses on procurement, data leakage, copyright, security, and acceptable use. Those issues matter. But Kucharski’s experiment points to a more prosaic risk: employees using approved tools in approved ways to produce unsupported conclusions.
That is harder to govern because nothing obviously illicit has happened. The user did not paste secrets into an unknown website. The assistant did not produce hate speech. The company did not violate a software license. It simply generated a wrong analysis that may be good enough to circulate.
This is where IT departments and data teams need to stop treating model choice as a personal preference. If an organization permits AI-assisted analysis, it needs minimum standards for which modes can be used, what checks must be performed, and what kinds of outputs require human review. Otherwise, “Copilot said” becomes the new “the spreadsheet says,” with even less transparency.
For Microsoft shops, the issue is especially pointed. Copilot’s power comes from proximity: it is where the documents are, where the spreadsheets are, where the meetings are, and where the users already work. That also means its mistakes can travel through the enterprise with minimal friction.
The governance answer is not to ban Copilot or Gemini from touching data. That would be unrealistic, and in many cases counterproductive. AI tools can be genuinely useful for cleaning text, generating code, suggesting classifications, drafting summaries, and spotting possible themes. The question is whether the tool is being used as an assistant to an analysis process or as a replacement for one.
If the output cannot be reproduced outside the chatbot, it should not be treated as analysis. If the assistant cannot provide the code, counts, categories, or sampling method behind the conclusion, the conclusion belongs in the brainstorming pile, not the board deck.

The User Interface Hides the One Choice That Matters​

The current model-picker experience is a mess even for enthusiasts. Names like “Flash,” “Pro,” “Instant,” “Reasoning,” “Thinking,” “Auto,” and “Advanced” do not map cleanly onto user needs. A fast model may be excellent for rewriting a paragraph but bad at checking a dataset. A reasoning model may be expensive and slower but far better at refusing the premise of a flawed question.
The average office worker has no reason to know this. Worse, vendors have spent years marketing AI as a natural-language layer that removes the need to understand technical systems. “Just ask” is the pitch. Kucharski’s experiment shows the catch: for some tasks, knowing how to ask and which model to ask is the difference between analysis and fiction.
This is a design problem, not merely a user education problem. If a user uploads a structured dataset and asks for group differences, the assistant should default to tool-based analysis, not cultural narration. It should check for duplicate rows, group sizes, missing values, and obvious confounders. It should state when it has not run calculations.
A better interface would make the mode explicit. It might say: “I can summarize this conversationally, or I can run a data analysis workflow using code.” It might warn: “This comparison involves demographic labels, so I will verify observed differences before interpreting them.” It might refuse to produce percentage claims without computing them.
Those features would slow the magic trick, which is probably why they are not always front and center. The consumer AI race has rewarded immediacy. Enterprise analytics requires friction. The product tension is obvious.
Microsoft and Google both understand this at some level. Their ecosystems already distinguish between lightweight assistant interactions and heavier analysis modes. But as long as “auto” can choose the wrong behavior without making that choice visible, the burden remains unfairly placed on the user.

The Correct Prompt Is Not a Substitute for a Correct Method​

Some AI power users will look at this failure and reach for prompt engineering. They will suggest adding instructions such as “do not rely on stereotypes,” “only use evidence in the dataset,” “run calculations before summarizing,” or “check whether the groups are identical.” Those are good habits. They are not enough.
Prompts are brittle because they are instructions, not guarantees. A model may follow them well on one dataset and poorly on another. It may comply in the first paragraph and drift later. It may produce an analysis that sounds evidence-based while still relying on latent assumptions.
The more reliable move is to separate computation from interpretation. First, use deterministic tools to count, compare, cluster, classify, or sample. Then use the language model to help explain the results, with the underlying tables and code available for inspection. In other words, let the model write around the evidence, not invent the evidence.
That workflow will feel less magical. It may also feel less convenient than dropping a file into a chat window and asking for insights. But convenience is exactly what created the failure. The model was able to answer too easily.
For serious analysis, the assistant should be forced to earn the narrative. It should identify the variables, state the comparison, run a check, show the result, and only then interpret. If the groups are identical, the right answer should be a dead stop: “There are no observed differences in the data provided.”
There is also a cultural shift required here. Users need to become more comfortable with boring answers. Not every dataset has a story. Not every segment differs. Not every open-ended survey hides a profound pattern. Sometimes the most valuable analytic output is the absence of a finding.

The Cost of a False Insight Is Paid Later​

The immediate harm of a fabricated AI analysis is embarrassment. The larger harm is decision drift. Once a false claim enters an organization’s shared understanding, it can shape roadmaps, hiring plans, marketing campaigns, support scripts, and internal politics.
Imagine an employee survey in which two offices express similar concerns, but an AI assistant frames one region as emotionally restrained and another as more openly dissatisfied. Imagine customer feedback in which identical complaints are interpreted differently by country or gender. Imagine a product team using AI-generated themes to conclude that one demographic cares more about price while another cares more about design.
These conclusions do not need to be malicious to be consequential. They only need to be plausible, repeated, and embedded in planning. The model’s stereotype becomes the organization’s “insight,” and the people represented in the data never get a chance to correct it.
This is why the issue should matter to sysadmins as well as data scientists. AI assistants are becoming part of the information supply chain. They influence what employees see, what managers believe, and what executives prioritize. If the default mode can generate unsupported group differences, then model settings are no longer a niche concern.
The same logic applies to legal discovery, HR investigations, security incident reviews, and compliance summaries. In each case, a model may be asked to summarize messy human text. In each case, labels and context can tempt the system into supplying a story that the evidence does not support.
The fix begins with treating AI output as an artifact that requires provenance. Who ran it? Which model or mode was used? Was code executed? Were the results reproducible? Were demographic comparisons checked for actual differences before interpretation? These questions sound bureaucratic until the wrong answer costs money.

The Lesson from the Fake Countries Is Uncomfortably Practical​

The useful response to Kucharski’s experiment is not panic. It is operational discipline. Users do not need a PhD in statistics to avoid the most obvious traps, but they do need to stop treating the default model as a neutral oracle.
  • The default or “auto” setting should not be used for consequential data analysis unless the tool clearly shows that it performed verifiable calculations.
  • A model that produces group comparisons from demographic labels should be treated with suspicion until the underlying counts or classifications are inspected.
  • Fast models are useful for drafting and summarizing, but reasoning or tool-using modes are better suited to tasks where the answer depends on the actual contents of a dataset.
  • Any AI-generated percentage, ranking, or segment difference should be reproducible through code, a spreadsheet formula, or another auditable method.
  • Users should write down the expected test or sanity check before switching models, because changing models after seeing a bad result invites hindsight bias.
  • Organizations should define approved workflows for AI-assisted analysis instead of leaving model choice to whoever happens to be holding the prompt.
The uncomfortable part is that these are not exotic safeguards. They are the basic hygiene of analysis, rediscovered because chatbots made it easy to skip them.
Kucharski’s fake-country datasets expose a weakness that will not be solved by a prettier model picker or a more reassuring brand name. The next generation of AI assistants will be faster, more integrated, and more capable, but the central question will remain whether they are reading the evidence or merely completing the story we accidentally asked them to tell. For Windows users and IT departments now living with AI inside the productivity stack, the safest default is no longer “auto”; it is skepticism until the tool shows its work.

References​

  1. Primary source: the-decoder.com
    Published: 2026-05-24T10:21:07.951413
  2. Related coverage: linkedin.com
  3. Related coverage: conferencesthatwork.com
  4. Related coverage: gov.uk
  5. Related coverage: geekwire.com
 

Back
Top