A sweeping, journalist‑led audit coordinated by the European Broadcasting Union (EBU) and led operationally by the BBC has found that leading AI chatbots routinely misrepresent news: in the study’s sample, 45% of AI-generated answers contained at least one significant issue, with pervasive sourcing and factual errors that cross languages and markets.
Public‑facing generative AI assistants are no longer curiosities — they are a first stop for many people searching for news. That shift prompted public broadcasters across Europe and beyond to ask a practical question: when journalists submit real newsroom queries to popular assistants, do those systems return accurate, well‑sourced, and context‑aware answers?
The resulting project, published at an EBU event and widely covered by international outlets, pooled editorial teams from 22 public‑service media organizations across 18 countries. Reviewers tested more than 3,000 replies from four widely used assistants — OpenAI’s ChatGPT, Microsoft’s Copilot, Google’s Gemini, and Perplexity AI — scoring each reply for accuracy, sourcing/provenance, contextualisation, and the separation of fact from opinion.
This audit scaled and extended an earlier BBC experiment from February 2025 that had already flagged high error rates when assistants summarised BBC articles. The newer, multilingual investigation repeats that editorial‑first approach at larger scale and with broader geographic coverage.
For Windows users, IT teams, and editors, the immediate approach is pragmatic and defensive: use AI assistants for leads, aggregation and ideation — but require human verification and provenance checks before acceptance. At the product and policy level, independent auditing, provenance standards, and refusal heuristics are the essential next steps to prevent AI from becoming an amplifying layer for misinformation.
The technology’s promise remains real: assistants can unlock accessibility, speed and new discovery models. Realising that promise responsibly requires engineers, publishers, and regulators to act in concert so that facts in truly mean facts out.
Source: Emegypt Major study reveals AI chatbots struggle with delivering accurate news
Background / Overview
Public‑facing generative AI assistants are no longer curiosities — they are a first stop for many people searching for news. That shift prompted public broadcasters across Europe and beyond to ask a practical question: when journalists submit real newsroom queries to popular assistants, do those systems return accurate, well‑sourced, and context‑aware answers?The resulting project, published at an EBU event and widely covered by international outlets, pooled editorial teams from 22 public‑service media organizations across 18 countries. Reviewers tested more than 3,000 replies from four widely used assistants — OpenAI’s ChatGPT, Microsoft’s Copilot, Google’s Gemini, and Perplexity AI — scoring each reply for accuracy, sourcing/provenance, contextualisation, and the separation of fact from opinion.
This audit scaled and extended an earlier BBC experiment from February 2025 that had already flagged high error rates when assistants summarised BBC articles. The newer, multilingual investigation repeats that editorial‑first approach at larger scale and with broader geographic coverage.
What the study measured — method and scope
The review deliberately adopted newsroom standards rather than automated “truth‑bench” metrics. Key elements:- Professional journalists and subject‑matter editors reviewed each response.
- The test set covered news‑focused, time‑sensitive queries designed to expose temporal staleness and attribution failures.
- Responses were assessed in 14 languages, ensuring the problem was not limited to English.
- Scoring categories included:
- Factual accuracy (dates, names, figures)
- Sourcing and provenance (does the assistant cite correct, verifiable sources?)
- Context and nuance (is the framing preserved, or decontextualised?)
- Fact vs opinion (is editorialising clearly labelled?)
Headline findings
The audit returned several stark, operational numbers that product teams, IT managers, and news consumers should treat as red flags:- 45% of AI answers contained at least one significant issue — enough to materially mislead a reader.
- 31% of responses had serious sourcing failures: missing, incorrect, or misleading attributions.
- 20% of responses included major factual inaccuracies, including temporal errors and fabricated details.
- Problems persisted across platforms and languages; no major assistant was immune.
Notable examples and concrete errors
The auditors documented concrete, high‑impact mistakes that illustrate the failure modes:- An assistant named the wrong head of government in a tested scenario — Olaf Scholz was still listed as German Chancellor even after he had been replaced in the auditors’ timeframe. Another case misidentified NATO leadership, naming Jens Stoltenberg after Mark Rutte had taken over. These are not trivial copy edits; they are current‑affairs errors that change the basic facts of governance.
- The BBC’s earlier February 2025 audit found instances where assistants altered or fabricated quotes and introduced incorrect dates and figures when summarising BBC reporting.
Who performed worst — model‑level variation
The study found measurable variation across assistants, particularly on source attribution:- Google’s Gemini showed the highest rate of sourcing problems in the sample: auditors flagged around 72% of Gemini responses for significant sourcing defects in the Reuters‑reported breakdown of results. That elevated sourcing failure rate drove much of Gemini’s poorer overall performance in the dataset.
- Other assistants also failed on various fronts, but none escaped having a meaningful share of problematic outputs.
Why the assistants fail: the technical anatomy of news errors
The audit’s editorial findings map neatly onto known technical failure modes inside retrieval‑augmented language systems:- Noisy retrieval: web‑grounded assistants fetch documents from the open web where stale, low‑quality, or manipulative pages exist. If retrieval returns weak evidence, the synthesised answer is likely to be flawed.
- Probabilistic generation and hallucination: large language models are statistical sequence predictors, not verifiers. In the absence of tight provenance signals, they may invent plausible‑sounding facts or compress nuance into incorrect declaratives.
- Post‑hoc or “reconstructed” citations: some products generate a fluent answer first and attach citations later — a mismatch that auditors repeatedly flagged as misleading when the cited page did not support the claim.
- Optimisation trade‑offs: vendors tune for helpfulness and fewer refusals; that reduces cautious “I don’t know” answers but increases confidently stated errors.
Why this matters for Windows users, IT teams and publishers
The WindowsForum readership spans desktop users, IT pros, and newsroom technologists. Translate the audit’s findings into operational impacts:- For desktop and enterprise deployments that integrate assistants (for example, Microsoft Copilot features in Windows and Microsoft 365), misstated news or stale guidance can propagate inside corporate comms or incident reports if outputs aren’t human‑verified.
- For users who rely on assistants for fast situational awareness (e.g., crisis updates, legal changes, or security advisories), an apparent headline with wrong facts or absent sourcing is particularly dangerous.
- For publishers and rights holders, sourcing errors that misattribute or distort original reporting create reputational and legal risk; the study expressly warned that AI misattribution jeopardises public trust in established news brands.
Policy reaction, campaigns and calls for regulation
The EBU and participating broadcasters framed the findings as a systemic problem requiring regulatory and product‑level fixes. Key policy developments and industry responses included:- The EBU urged EU and national regulators to enforce existing laws on information integrity, digital services and media pluralism, citing the urgency of independent monitoring.
- Broadcasters launched a collaborative campaign called “Facts In: Facts Out”, calling on AI companies to ensure that when trusted news is used as an input the assistants must return accurate, attributable facts — captured in the slogan: If facts go in, facts must come out.
- Public statements from EBU leadership underscored civic risk: “When people don’t know what to trust, they end up trusting nothing at all,” said Jean Philip De Tender, highlighting the potential chilling effect on democratic participation.
Vendor responses and the limits of product fixes
Major AI vendors have acknowledged hallucination and sourcing challenges in public communications, and product teams have rolled out partial mitigations: in‑line citations, controls for web crawling, and confidence indicators. But the audit demonstrates three limits:- Partial mitigations without provenance guarantees still produce misleading outputs when retrieval is noisy.
- Citation hygiene — listing a link — does not guarantee the link supports the claim if retrieval and composition are mismatched. Auditors repeatedly flagged “ceremonial citations.”
- Product incentives (maximise helpfulness and completion) can be at odds with cautious refusal behaviour; fixing this is both a technical and policy challenge.
Practical recommendations for technical teams and power users
The study is a wake‑up call. The following is a practical checklist for Windows users, IT administrators, and newsroom technologists who depend on assistants.- For individual users:
- Always check the assistant’s sources (where provided) before using the information in a decision.
- Treat assistant output as a first draft: verify dates, names, and figures against primary sources.
- Prefer publisher direct or official site links for time‑sensitive matters (health, legal, security).
- For IT and operations teams:
- Implement human review gates for any assistant outputs used in corporate communications or incident response.
- Log assistant responses and provenance for auditability.
- Configure assistant modes that prioritise caution over verbosity for news queries — e.g., “verified‑news mode.”
- Enforce internal policies that require explicit source confirmation for any external claims generated by assistants.
- For publishers and newsroom leaders:
- Adopt machine‑readable reuse controls and clear licensing terms so publisher‑verified content can be used as canonical inputs.
- Participate in independent audits and publish transparency reports on how content is indexed by AI crawlers.
Editorial analysis — strengths, weaknesses, and systemic risks
Strengths of the EBU/BBC approach:- The journalist‑first methodology is the right tool for evaluating news integrity: expert reviewers, real newsroom queries, and multilingual scope give operationally relevant diagnostics.
- Current generation assistants still conflate retrieval confidence with factuality, producing fluent but unsupported claims — a classic confidence‑without‑evidence failure mode.
- Citation reconstruction and the absence of strict provenance linking mean a link can masquerade as verification without supporting the actual claim.
- Optimization for helpfulness creates product incentives that discourage cautious silence; without policy or design corrections, error rates will persist as assistant adoption grows.
- The combination of increasing assistant adoption among younger users and high per‑query error rates means small error rates scale into major civic misperceptions. The Reuters Institute numbers on usage (7% general, 15% under 25) make this especially important for election cycles, public health messages, and civic processes.
What independent auditing and regulation should require
To move from patchwork fixes to durable safety, the report and commentators converge on several policy levers:- Mandatory provenance reporting: assistants should return machine‑readable source metadata and timestamps for every claim.
- Regular independent audits against newsroom standards, with multilingual sampling and public summaries of error rates.
- Product design rules that prioritise refusal or hedged answers where provenance is weak, rather than confident but unsupported claims.
- Government enforcement of existing laws on platform accountability, information integrity and media pluralism — a step the EBU signalled publicly.
A pragmatic path forward for product teams
For engineers building assistant features into operating systems, browsers or corporate tooling, the audit suggests a pragmatic engineering checklist:- Strengthen retrieval filters with trust signals (publisher authority, recency, canonical sources).
- Attach evidence snippets — the exact sentence or paragraph the assistant used — rather than a standalone link.
- Provide confidence scores and explicit provenance for every claim.
- Offer a conservative “verified news” mode that refuses to answer when provenance is insufficient.
- Expose an easy feedback channel so publishers can flag and request corrections or de‑indexing of harmful misattributions.
Conclusion — how to use AI for news responsibly
The EBU/BBC audit is an operational alarm bell: AI assistants can speed discovery and help users process the flood of information, but they are not yet reliable stand‑alone news sources. The combination of high adoption among younger users and systemic sourcing and factual errors creates a material risk to public trust and civic discourse.For Windows users, IT teams, and editors, the immediate approach is pragmatic and defensive: use AI assistants for leads, aggregation and ideation — but require human verification and provenance checks before acceptance. At the product and policy level, independent auditing, provenance standards, and refusal heuristics are the essential next steps to prevent AI from becoming an amplifying layer for misinformation.
The technology’s promise remains real: assistants can unlock accessibility, speed and new discovery models. Realising that promise responsibly requires engineers, publishers, and regulators to act in concert so that facts in truly mean facts out.
Source: Emegypt Major study reveals AI chatbots struggle with delivering accurate news