A coordinated editorial audit led by the European Broadcasting Union (EBU) and supported by the BBC has delivered a blunt verdict: when asked about news events, popular AI assistants — including OpenAI’s ChatGPT, Microsoft’s Copilot, Google’s Gemini and Perplexity — produced answers with material problems in almost half of cases, undermining their suitability as standalone news sources for readers, editors and IT professionals.
The EBU‑coordinated study aggregated the judgment of professional journalists and subject experts from 22 public broadcasters across 18 countries, evaluating roughly 3,000 assistant replies in 14 languages to a common set of news‑related prompts. Reviewers scored outputs against newsroom standards for factual accuracy, sourcing/provenance, context and nuance, and the separation of fact from opinion or satire — a methodology intentionally designed to reflect the real editorial tasks newsrooms face.
At a glance, the central findings are stark:
For publishers, the problem is unique and acute: assistants that summarise reporting can compress hedging, omit context and even alter quotations, effectively reshaping editorial nuance without the mechanisms newsrooms use to correct or retract content. That creates reputational exposure when audiences discover errors and attribute them to the original publisher rather than to the summarising tool.
Flag: any single audit’s vendor ranking is a snapshot, not a permanent verdict; future updates can materially change these numbers.
These interventions are not just technocratic: when assistants act as large‑scale intermediaries for news, even modest error rates can amplify rapidly across platforms and social networks, with measurable civic consequences.
Source: NST Online AI not a reliable source of news | New Straits Times
Background / Overview
The EBU‑coordinated study aggregated the judgment of professional journalists and subject experts from 22 public broadcasters across 18 countries, evaluating roughly 3,000 assistant replies in 14 languages to a common set of news‑related prompts. Reviewers scored outputs against newsroom standards for factual accuracy, sourcing/provenance, context and nuance, and the separation of fact from opinion or satire — a methodology intentionally designed to reflect the real editorial tasks newsrooms face. At a glance, the central findings are stark:
- 45% of AI answers had at least one significant issue judged likely to mislead a reader.
- About 81% of responses contained some detectable problem when minor errors were included.
- Roughly 31–33% of replies showed serious sourcing failures — missing, incorrect, or misleading attribution.
- Around 20% contained major factual or temporal errors (for example, naming the wrong officeholder or inventing events).
Why the audit matters: context for Windows users, IT teams and publishers
AI assistants are no longer curiosities; they’re becoming an everyday information gateway. Elements of these assistants have been embedded into operating systems, browsers and productivity apps — most notably Microsoft’s integration of Copilot across Windows and Office — so inaccuracies risk propagating into workplace decisions, internal communications and public-facing content. That means the audit’s findings are operationally relevant for system administrators, newsroom editors and power users who rely on assistant summaries for quick orientation.For publishers, the problem is unique and acute: assistants that summarise reporting can compress hedging, omit context and even alter quotations, effectively reshaping editorial nuance without the mechanisms newsrooms use to correct or retract content. That creates reputational exposure when audiences discover errors and attribute them to the original publisher rather than to the summarising tool.
How the audit was run — methodology and editorial realism
Human reviewers, not automated metrics
The audit’s standout feature is its editorial approach. Instead of evaluating outputs against automated truth labels or constrained benchmarks, trained journalists and subject experts judged whether answers met newsroom standards. That choice maps evaluation criteria directly onto the real responsibilities of publishers and editors.Multilingual and multinational sampling
Responses were collected in 14 languages and across 18 countries, so the results aren’t an English‑language artifact. The study intentionally included time‑sensitive and contentious topics to expose failure modes that matter most for public information.Focused on news Q&A
The audit targeted news queries — not code generation, math problems, or creative writing — and therefore measures assistant performance in the most consequential domain for civic life: current events, public policy, health and legal information. This domain‑specific focus is why its results should be taken seriously by newsrooms and public institutions.What went wrong — the audit’s failure taxonomy
The audit identifies recurring, consequential failure modes that are both technical and product‑driven. These are the patterns journalists encountered across assistants and languages.1. Temporal staleness and outdated facts
Assistants frequently presented out‑of‑date information as current fact — for example, reporting a predecessor as an incumbent. The EBU audit documented cases where an assistant continued to assert that “Francis” was the Pope months after a reported succession, illustrating how stale knowledge and retrieval caches turn into active misinformation.2. Hallucinations and invented events
Roughly one in five responses contained major accuracy issues, including invented details and events that never occurred. These hallucinations are not stylistic mistakes; they can fabricate names, dates, and quotes that appear authoritative.3. Sourcing failures and misattribution
About a third of answers failed to provide correct or usable attribution — they cited the wrong source, no source, or a non‑authoritative page. When provenance is weak, users have no practical way to verify claims, and the assistant’s apparent authority collapses.4. Misreading satire, parody and opinion
The study documented assistants treating satire as fact, with at least one example where content originating in a satirical column was taken literally and repeated as reporting. This demonstrates weak cross‑checking and inadequate discriminators for content type within retrieval pipelines.5. Altered or fabricated quotations
Compressing journalistic reporting often led to changes in quoted material that altered meaning; in some cases quotes were paraphrased into misleading forms or entirely invented. That’s a direct threat to editorial integrity when assistants act as middlemen between audiences and reporting.Technical anatomy: why these errors happen
Modern assistants are pipelines of three core components: a retrieval layer, a generative model, and a provenance/citation layer. Failures typically arise from interaction effects, not a single bug.- Retrieval brittleness: web grounding improves recency but exposes systems to low‑quality pages, SEO‑optimized content and intentionally misleading sources. When retrieval returns weak evidence, the model can still produce fluent but unsupported claims.
- Probabilistic generation (hallucination): LLMs predict likely word sequences rather than verify facts. Without strong grounding, they fabricate plausible but false details.
- Post‑hoc provenance reconstruction: some assistants reconstruct citations after composing the answer instead of surfacing the exact retrieved evidence used to generate claims. That leads to misaligned citations and ceremonial links that don’t substantiate the text.
- Optimization for helpfulness over caution: reward models that penalise “I don’t know” encourage confident answers even when evidence is weak, reducing safe refusals and increasing the chance of misstatements.
Vendor variation and nuance: Google’s Gemini in the spotlight (with caution)
The audit found variation across products: in the sampled dataset, Google’s Gemini displayed a notably higher rate of sourcing problems than the other assistants tested. Media summaries reported Gemini’s sourcing failure rate in the range of roughly 72–76% in the audited sample, a figure significantly above its peers in that snapshot. That vendor‑level disparity points to differences in retrieval architecture, citation pipelines, or product configuration. However, these vendor percentages should be treated as sample‑specific and provisional, because models and retrieval stacks are updated frequently and audit snapshots reflect behavior at a specific time.Flag: any single audit’s vendor ranking is a snapshot, not a permanent verdict; future updates can materially change these numbers.
Vivid examples the audit flagged
Concrete instances make the abstract risks immediate:- When asked “Who is the Pope?”, several assistants named “Francis” even though auditors reported a succession had occurred in that test scenario — an example of temporal error presented as current fact.
- In a striking satire‑misreading example, a model took a satirical column about Elon Musk literally and produced bizarre, fabricated assertions attributed to the billionaire. That shows how weak content‑type discrimination can convert parody into apparent reportage.
Strengths of the EBU/BBC approach — why it matters
- Editorial realism: Judging outputs by newsroom standards provides actionable diagnostics for publishers and product teams.
- Scale and diversity: Thousands of responses across 14 languages reduce the chance the findings are English‑centric.
- Actionable taxonomy: The failure modes identified map to technical and policy remedies — retrieval auditing, provenance standards, conservative refusal heuristics, and human review gates.
Limits, caveats and what the study does not mean
The audit must be read with care:- It is a snapshot in time. Assistants, indexes and retrieval layers change rapidly; vendor updates can improve (or worsen) performance between audit waves.
- It focuses on news Q&A, intentionally stressing time‑sensitive and contentious items. Its figures do not imply the same failure rates for other assistant tasks like coding or math.
- Selection bias is deliberate: reviewers selected questions that stress the systems in ways that matter for public information. That makes the findings urgent for news contexts but not an across‑the‑board condemnation of all LLM use.
Practical recommendations: a near‑term playbook
The audit translates into concrete, implementable steps for vendors, publishers, IT teams and users.For vendors and product teams
- Prioritise provenance‑first UIs: surface explicit, timestamped source snippets and make the retrieved evidence auditable.
- Implement conservative refusal thresholds for news queries: prefer a guarded answer or a refusal over confident conjecture when evidence is weak.
- Expose model/version IDs and update cadences so organizations can audit outputs and correlate behaviour with product changes.
For publishers and newsrooms
- Publish machine‑readable provenance and correction feeds that assistants can ingest to identify canonical content and live corrections.
- Negotiate canonical access to publisher feeds and structured metadata to reduce reliance on noisy, second‑hand copies of reporting.
For enterprise IT and Windows administrators
- Configure assistant integrations so news or public‑interest outputs require human review before distribution.
- Enforce logging of prompts and outputs for auditability and forensic review.
- Train staff in AI literacy — how to spot provenance gaps, ask for citations, and cross‑check urgent claims.
For individual users
- Treat assistant answers as starting points, not final authorities; always open the cited link before acting on critical claims.
Policy and regulatory angles
The audit strengthens the case for enforceable transparency standards for systems that influence public opinion. Potential regulatory levers include mandatory provenance metadata, third‑party audits for news‑facing models, and clear liability frameworks for demonstrable harms caused by false assistant output. The EBU collaboration is already being used in policy conversations about provenance and independent testing.These interventions are not just technocratic: when assistants act as large‑scale intermediaries for news, even modest error rates can amplify rapidly across platforms and social networks, with measurable civic consequences.
Critical analysis: strengths, risks and the path ahead
Strengths in the audit’s findings
- The editorial review methodology makes the audit directly relevant to newsrooms and public information contexts.
- Multilingual sampling reduces the chance the results are market‑specific.
- The taxonomy of failure modes maps to concrete engineering and governance remedies.
Key risks the audit highlights
- Authority without provenance: fluent, confident prose that lacks verifiable sourcing is especially dangerous because it leverages users’ trust in readable summaries while offering no practical verification path.
- Misattribution and reputational harm: when assistants attribute errors or invented claims to reputable outlets, publishers can suffer reputational damage without a clear correction channel.
- Civic amplification: assistants embedded at scale in search, browsers and devices can propagate small error rates into significant misinformation waves.
Where the picture is uncertain
- Vendor‑level performance variability observed in the audit — particularly the high sourcing error rate reported for Gemini in this sample — is important but time‑sensitive. Performance can shift with updates to retrieval pipelines, citation logic and training data, so vendor comparisons should be treated as provisional.
A pragmatic roadmap: moving from alarm to action
The audit is a wake‑up call but also a practical checklist. Delivering safer AI‑assisted news requires coordinated action on three fronts:- Engineering: improve retrieval quality, expose retrieved evidence, and implement conservative generation heuristics.
- Editorial: publish machine‑readable provenance, correction feeds and canonical snippets so summarisation pipelines can rely on verified inputs.
- Governance: support independent audits, provenance standards and policy frameworks that mandate transparency for systems deployed at scale in public‑interest contexts.
Conclusion
The EBU/BBC audit provides a rigorous, journalist‑led diagnosis: conversational AI assistants deliver valuable orientation and convenience, but their current failure modes — temporal drift, sourcing mismatches, hallucinations and misread satire — make them unreliable as standalone news arbiters. The problem is tractable, not insoluble. Implementing provenance‑first interfaces, conservative refusal heuristics, independent audits and publisher collaboration can materially reduce risk. Until those reforms are widespread, the prudent rule for readers, publishers and IT professionals alike is simple and non‑ideological: use AI assistants for quick orientation, verify before you act, and require explicit, timestamped sourcing whenever decisions or public statements depend on the answer.Source: NST Online AI not a reliable source of news | New Straits Times