The latest consumer-facing audits and public‑service studies paint a stark picture: mainstream AI assistants are regularly making repeated factual errors, misattributing sources, and presenting confident but unreliable guidance — problems that matter now that these systems are embedded into browsers, search results, and productivity features used by millions.
AI-driven assistants — generative language models combined with retrieval layers — have moved from novelty toys into everyday tools. They now appear inside search engines (AI overviews), desktop assistants (Microsoft Copilot), and as standalone chat interfaces (ChatGPT, Gemini, Perplexity). That rapid adoption has prompted two complementary lines of public research in 2025: a consumer‑facing assessment by Which? that tested practical everyday queries, and a large journalist‑led audit coordinated by the European Broadcasting Union (EBU) and led by the BBC that stress‑tested assistants on news and current‑affairs questions. Both efforts reach the same basic conclusion: convenience has outpaced reliability. The EBU/BBC project — titled News Integrity in AI Assistants — is one of the most methodologically rigorous audits to date. It asked PSM (public service media) journalists across 18 countries to submit real newsroom questions and to blind‑review more than 3,000 responses from ChatGPT, Microsoft Copilot, Google Gemini and Perplexity. Reviewers scored replies against newsroom standards: factual accuracy, sourcing/provenance, context and nuance, separation of fact from opinion, and quotation fidelity. The headline finding: about 45% of answers contained at least one significant issue. Separately, Which? sampled consumer behaviour and evaluated six public assistants on 40 everyday consumer scenarios (finance, legal, health/diet, consumer rights and travel). Which? found widespread repeated factual errors, incomplete or overconfident advice, reliance on weak web sources, and guidance that pushed users toward paid services rather than free, reliable resources — a concerning pattern when users say they increasingly trust AI over traditional search. These consumer findings were reported by mainstream outlets summarizing Which?’s work.
Source: AOL.com AI tools are making ‘repeated factual errors’, major new research warns
Background / Overview
AI-driven assistants — generative language models combined with retrieval layers — have moved from novelty toys into everyday tools. They now appear inside search engines (AI overviews), desktop assistants (Microsoft Copilot), and as standalone chat interfaces (ChatGPT, Gemini, Perplexity). That rapid adoption has prompted two complementary lines of public research in 2025: a consumer‑facing assessment by Which? that tested practical everyday queries, and a large journalist‑led audit coordinated by the European Broadcasting Union (EBU) and led by the BBC that stress‑tested assistants on news and current‑affairs questions. Both efforts reach the same basic conclusion: convenience has outpaced reliability. The EBU/BBC project — titled News Integrity in AI Assistants — is one of the most methodologically rigorous audits to date. It asked PSM (public service media) journalists across 18 countries to submit real newsroom questions and to blind‑review more than 3,000 responses from ChatGPT, Microsoft Copilot, Google Gemini and Perplexity. Reviewers scored replies against newsroom standards: factual accuracy, sourcing/provenance, context and nuance, separation of fact from opinion, and quotation fidelity. The headline finding: about 45% of answers contained at least one significant issue. Separately, Which? sampled consumer behaviour and evaluated six public assistants on 40 everyday consumer scenarios (finance, legal, health/diet, consumer rights and travel). Which? found widespread repeated factual errors, incomplete or overconfident advice, reliance on weak web sources, and guidance that pushed users toward paid services rather than free, reliable resources — a concerning pattern when users say they increasingly trust AI over traditional search. These consumer findings were reported by mainstream outlets summarizing Which?’s work. What the major audits actually measured
The EBU/BBC newsroom audit — scope and methodology
- Geographic and linguistic breadth: 22 public broadcasters, 18 countries, 14 languages; more than 3,000 AI replies evaluated by trained journalists.
- Editorial standards: responses were judged using newsroom criteria rather than simple automated truth metrics — a decisive design choice that exposes editorial risks (misquotation, loss of nuance, mixing opinion with fact).
- Task selection: time‑sensitive, contested, or civic‑important news questions that would expose temporal staleness and provenance failures.
- 45% of examined answers had at least one significant issue.
- 31% showed serious sourcing problems (missing, misleading, or incorrect attribution).
- 20% contained major accuracy issues (hallucinations or stale facts).
- Vendor variance: Google Gemini performed worst on sourcing in this sample (76% of Gemini’s replies flagged for significant issues in the study).
The Which? consumer test — practical queries and trust signals
Which? evaluated consumer‑oriented scenarios (40 common questions) and surveyed 4,189 UK adults about their AI use. The consumer report — as summarized in multiple press pieces — found:- Widespread repeated factual errors across consumer assistants.
- Many answers used weak sources (old forum threads, low‑quality pages).
- A majority of respondents either trusted AI outputs to a significant degree or preferred AI results over standard web searches.
Why these assistants fail: the technical anatomy of the errors
Three recurring failure modes appear across audits and technical reviews:- Temporal staleness (outdated knowledge)
Many LLMs retain knowledge only up to their training/data cutoff unless they are properly connected to fresh retrieval layers or live APIs. Even with web access, caching, retrieval heuristics, or poor source prioritization can leave outputs stale. The EBU audit documents examples where assistants named replaced officeholders or described events that did not occur. - Hallucinations and invention (confabulation)
Generative models produce fluent text by predicting token sequences — they are not verifiers. When evidence is thin, the model may fabricate specifics (dates, quotes, legal details) that sound right. Journalists in the audits found invented quotes, misdated facts, and fabricated URLs. Independent technical literature also documents hallucination as a systemic issue for probabilistic LLMs. - Sourcing and provenance failures
Even when a tool cites a source, that citation may be wrong, incomplete, or point to secondary/syndicated content instead of the primary reporting. The audits found missing or misleading attributions in roughly one‑third of responses — a classic editorial failure that undermines traceability and trust.
Concrete examples and real risks
The journalists’ audits collected vivid exemplars that are not merely technical curiosities but can produce public‑harm outcomes:- An assistant stated incorrect public‑health guidance about vaping and its role in smoking cessation, inverting the actual NHS position — a potentially harmful misrepresentation for health decisions.
- Instances where assistants named a replaced or deceased leader (for example, reporting “Pope Francis” as the current pontiff months after auditors logged a succession) illustrate temporal staleness presented as current fact.
- Consumer scenarios flagged by Which? included incorrect tax allowances, faulty travel refund advice and financial guidance that left users open to expensive third‑party services. Those kinds of errors can cause measurable financial or legal harm for people who accept the assistant’s confident answer without verification.
Vendor responses and product context
Public statements from major vendors emphasize improvements and caution:- Google highlights built‑in reminders in Gemini and recommendations to consult professionals for sensitive topics.
- Microsoft points to Copilot’s linked citations and encourages user verification while noting that Copilot synthesizes multiple web sources into a single answer.
- OpenAI stresses use of browsing tools and search features for source visibility and notes ongoing accuracy improvements in newer models.
What this means for Windows users, IT teams and enterprise deployments
AI assistants are no longer optional add‑ons to the desktop experience: Microsoft’s Copilot is available inside Windows and Microsoft 365 workflows, and many Windows users will encounter assistant‑generated summaries in Edge and other integrated surfaces. That raises three operational implications for Windows fans, administrators, and security teams:- Trust calibration is required. Treat assistant outputs as drafts or starting points, not final answers for legal, financial or medical decisions. The EBU and Which? findings underline the need for human verification in high‑stakes contexts.
- Policy and configuration matter. Organizations should set explicit policies on when AI‑summarized content can be used (for example: never for regulatory compliance, legal advice, or clinical decisions), and control Copilot or browser AI features through group policy and product configuration where possible. Vendor controls and enterprise admin tooling should be reviewed and hardened before broad rollout.
- Training and workflows should assume verification. Desktop productivity gains from Copilot are real, but they must be balanced with verification workflows — add checklist items to standard operating procedures (SOPs) requiring a traceable source and human sign‑off for sensitive outputs.
Strengths worth preserving
While the audits highlight serious risks, the tools also deliver measurable benefits that explain their rapid uptake:- Speed and productivity: AI assistants can draft, summarize, and synthesize information quickly — a net gain for many routine tasks. This is why users are adopting them as a first stop.
- Accessibility and discovery: For many users, an AI overview condenses complex information and helps navigate unfamiliar domains — provided the information is accurate and sources are visible.
- Feedback loops: Public audits and toolkits (for example, the EBU’s News Integrity in AI Assistants Toolkit) give vendors and journalists a structured way to test and improve systems and provide practical guidance for product teams. That collaborative model is a genuine industry good.
Critical analysis — strengths, weaknesses and the governance gap
Strengths in the research approach
- The EBU/BBC audit uses editorial standards and human expert review rather than narrow automated metrics. That makes the results operationally meaningful for newsrooms, enterprises, and public policy.
- The Which? consumer test privileges practical consumer use cases (finance, travel, legal), which are the very scenarios where errors have immediate consequences. Media coverage cross‑confirms those user‑facing failure modes.
Systemic weaknesses the audits expose
- Answer‑first product incentives. Many assistants favor producing a single confident answer that minimises friction; that design can bias systems toward plausible rather than verifiable outputs. The audits show that product UX choices (simplicity) can directly amplify misinformation risk.
- Opaque provenance. Citing is not the same as correct citation. The audit found many cases where links or attributions were missing, wrong or misleading — a core transparency failure.
- Regulatory and accountability gaps. Auditors call for vendor transparency (regular publishing of error rates by language and market) and stronger enforcement of information integrity rules; regulators have only started to respond in a patchwork way.
Governance recommendations (high level)
- Vendors should publish regular, machine‑readable metrics on accuracy and provenance by language/market and make independent audits routine.
- Platforms need to prioritize refusal or deferral for sensitive queries (health, legal, financial) unless evidence quality meets a strict bar.
- Policymakers should require traceable citations and provenance guarantees for answer‑first interfaces used in news and civic contexts.
Practical guidance: how Windows users and admins should respond now
Short checklist for users and IT teams to reduce harm and preserve productivity:- Verify, don’t assume: always check AI‑generated facts and quotations against primary sources before acting on them.
- Configure conservatively: for enterprise Windows deployments, review Copilot and Edge AI settings; disable or restrict assistant features for regulated workflows.
- Preserve provenance: require that assistant outputs used for reporting or decision‑making include explicit, verifiable citations and timestamps.
- Train staff: include a mandatory verification step in SOPs for workflows that touch legal, financial, or clinical decisions.
- Report and escalate: if an assistant repeatedly misattributes or fabricates facts, report the failure to your vendor contact and log the incident for vendor auditing.
Limitations, caveats and unverifiable claims
- The audits and press reporting are consistent on the big picture: AI assistants still get facts wrong often enough to be concerning. EBU/BBC figures (45% significant issues; 31% sourcing failures; 20% major accuracy problems) are corroborated by multiple major outlets, and the EBU has published the underlying toolkit and report to support reproducibility.
- The Which? consumer data (40 questions, 4,189 respondents) is widely reported in press summaries, but the original Which? release should be consulted where possible for the scoring rubric and item‑level judgements. Press coverage is consistent but — as with any secondary report — the primary document remains the authoritative source for detailed methodology. Treat consumer test specifics as credible but check Which?’s full report before relying on granular numbers in procurement or policy contexts.
- Vendor product changes happen rapidly. An assistant’s factuality profile can change with a single model update or retrieval reconfiguration, so audits are snapshots. Continuous monitoring and frequent independent testing are therefore essential.
The road ahead — engineering, editorial and regulatory solutions
Fixing these systemic problems will require coordinated action across three domains:- Engineering: improve retrieval quality, strengthen grounding mechanisms, adopt verification loops (retrieve‑then‑verify), and enforce conservative refusal for sensitive queries. Research into factuality detection and retrieval‑enhanced generation needs production investment.
- Editorial/product design: shift incentives away from a single “definitive” answer and toward transparent multi‑source overviews that surface uncertainty and provenance by default. Toolkits like the EBU’s offer concrete rubrics for what a good news answer looks like.
- Governance: require transparency reporting, independent auditing, and targeted regulation in public‑interest cases (news, health, legal). The scale and civic importance of these systems makes self‑regulation insufficient.
Conclusion
The headline is uncomfortable but inescapable: AI assistants are useful and powerful, but they are still fallible in ways that can matter for money, health, law and public information. Public‑service audits from the BBC and EBU, plus consumer testing reported from Which?, converge on the same diagnosis — frequent errors, sourcing failures and confident misstatements are not rare edge cases but recurring failure modes. That diagnosis does not argue for abandoning generative AI; it argues for disciplined, engineering‑led improvement, stricter product design choices that prioritize provenance and refusal, and strong human workflows that treat AI outputs as provisional. For Windows users and IT professionals, the practical playbook is simple: keep using AI tools for productivity gains, but build verification into every step of any decision that matters. The technology is valuable — but until provenance and factuality are fixed at scale, trust must be earned, not assumed.Source: AOL.com AI tools are making ‘repeated factual errors’, major new research warns