A consumer-focused round of tests in the UK has found that popular AI chatbots routinely give inconsistent, and sometimes risky, advice on everyday topics such as taxes, travel refunds, consumer rights, and basic health questions — with lesser‑known tools outperforming some household names in raw reliability. The investigation tested a fixed set of 40 realistic consumer prompts across six mainstream assistants and uncovered measurable failures: numeric errors that could affect tax and savings, blanket legal statements that ignore regulatory nuance, and sourcing behavior that steers users toward weak or commercial pages rather than primary authorities.
AI chatbots have become a go‑to first stop for millions of people seeking quick answers to practical questions. That shift from search engines and human advisers to conversational assistants makes the reliability of those assistants a consumer‑protection issue, not just a technical curiosity. Two recent waves of independent testing underline the scope of the problem.
First, a consumer‑facing test run by the UK consumer group Which? gave 40 everyday, jurisdiction‑sensitive questions to six mainstream assistants — ChatGPT (OpenAI), Google Gemini (and a Gemini AIO research mode), Microsoft Copilot, Meta AI, and Perplexity — and scored replies by subject‑matter experts on accuracy, clarity, usefulness, relevance and ethical responsibility. The headline ranking placed Perplexity at the top (~71%) and Meta AI at the bottom (~55%), with ChatGPT scoring around 64% and Microsoft Copilot about 68% in this snapshot. The test flagged concrete, consequential errors such as incorrect ISA (Individual Savings Account) allowance guidance and misleading travel‑refund advice.
Second, a separate, larger international study coordinated by the European Broadcasting Union (EBU) and led by the BBC — primarily focused on news integrity — found that nearly half of AI answers to news prompts contained a significant issue, and roughly a third suffered from serious sourcing problems. That project evaluated more than 3,000 assistant replies across 14 languages and 18 countries, concluding that the failure modes are systemic and multilingual.
Together, the consumer and newsroom audits draw a consistent picture: conversational fluency increasingly masks lapses in factual grounding, provenance, and domain nuance.
Where precise details could not be independently corroborated — for example, internal telemetry or vendor‑specific mitigation timelines — the public reports explicitly flagged those items as time‑bound or requiring follow‑up. That caution is sensible: the model landscape is volatile, and a single product update can change behavior within weeks.
Source: Digital Information World UK Study Finds Popular AI Tools Provide Inconsistent Consumer Advice
Background / Overview
AI chatbots have become a go‑to first stop for millions of people seeking quick answers to practical questions. That shift from search engines and human advisers to conversational assistants makes the reliability of those assistants a consumer‑protection issue, not just a technical curiosity. Two recent waves of independent testing underline the scope of the problem.First, a consumer‑facing test run by the UK consumer group Which? gave 40 everyday, jurisdiction‑sensitive questions to six mainstream assistants — ChatGPT (OpenAI), Google Gemini (and a Gemini AIO research mode), Microsoft Copilot, Meta AI, and Perplexity — and scored replies by subject‑matter experts on accuracy, clarity, usefulness, relevance and ethical responsibility. The headline ranking placed Perplexity at the top (~71%) and Meta AI at the bottom (~55%), with ChatGPT scoring around 64% and Microsoft Copilot about 68% in this snapshot. The test flagged concrete, consequential errors such as incorrect ISA (Individual Savings Account) allowance guidance and misleading travel‑refund advice.
Second, a separate, larger international study coordinated by the European Broadcasting Union (EBU) and led by the BBC — primarily focused on news integrity — found that nearly half of AI answers to news prompts contained a significant issue, and roughly a third suffered from serious sourcing problems. That project evaluated more than 3,000 assistant replies across 14 languages and 18 countries, concluding that the failure modes are systemic and multilingual.
Together, the consumer and newsroom audits draw a consistent picture: conversational fluency increasingly masks lapses in factual grounding, provenance, and domain nuance.
What the Which? consumer test did and why it matters
Scope and methodology
Which? designed a set of 40 realistic, consumer‑oriented prompts to reflect the kinds of quick queries non‑experts typically bring to chatbots: tax allowances, flight cancellations and refunds, consumer rights against merchants, health‑adjacent questions, diet guidance, and straightforward legal entitlements framed in a UK context. All assistants were asked the same prompts and evaluated by domain experts against a multi‑axis rubric that prioritized practical correctness and ethical responsibility. This was not a study of generative fluency — it measured real‑world risk.Key, load‑bearing findings
- Perplexity led the pack with an overall reliability score near 71%; Gemini’s research AI Overview and Gemini standard registered roughly 70% and 69% respectively; Microsoft Copilot scored about 68%; ChatGPT around 64%; and Meta AI was lowest at around 55%. These figures compress complex error matrices into a single indicator of practical reliability.
- Concrete errors were not mere stylistic niggles. Several assistants accepted a deliberately false user premise that the ISA annual allowance was £25,000 and proceeded to give guidance based on that inflated figure rather than correcting it; the correct allowance for the tested tax year was £20,000. Acting on such numeric errors can have real tax and regulatory consequences.
- On travel and consumer‑rights prompts, some assistants offered blanket statements — for example, implying passengers are “always” entitled to a full refund for cancelled flights — thereby eliding the exceptions, timing rules and extraordinary‑circumstance clauses that materially affect remedies. That kind of over‑simplification can steer a user toward the wrong course of action.
- Several replies surfaced low‑quality source links or pointed users toward commercial tax‑reclaim services that charge high fees — introducing a tangible consumer‑protection risk where the assistant’s citation hygiene is poor.
Strengths and limits of the test
The Which? approach mattered because it mirrored ordinary consumer behavior and used human experts to weigh practical harms. But the results are a snapshot: model updates, live‑web connectors, and prompt phrasing can materially change results. Which? and accompanying media coverage flagged those caveats and treated per‑model percentages as indicative rather than immutable.What the EBU / BBC newsroom study adds
The EBU/BBC "News Integrity in AI Assistants" project broadened the lens to news and current affairs, assessing how assistants handle questions that require accurate sourcing and clear separation between opinion and fact. Its principal findings are stark:- 45% of AI answers contained at least one significant issue.
- 31% of responses had serious sourcing defects (missing, misleading, or incorrect attributions).
- 20% contained major accuracy problems — hallucinated or outdated facts.
Why these failures happen — a tech anatomy
The error modes recorded by Which?, the EBU, and multiple journalistic audits converge on a handful of technical and product‑design fault lines:- Hallucinations: generative models sometimes invent plausible but false details. Even when a retrieval layer is present, the generative step can synthesize narrative claims not supported by sources.
- Retrieval and provenance failures: when the web retrieval layer surfaces weak, manipulated, or syndicated pages, the assistant may summarize or amplify that content as if it were authoritative.
- Sycophancy / premise acceptance: models optimized for helpfulness often accept the user’s stated facts rather than questioning numeric or legal premises, compounding errors when users supply incorrect inputs.
- Optimization tradeoffs: vendors tune assistants for engagement and low refusal rates. More assertive, fluid answers can increase user satisfaction but also the risk of confidently stated falsehoods.
Notable examples and consumer impact
The ISA allowance example (finance)
Two widely used assistants failed to challenge a user‑supplied ISA allowance of £25,000 and advised on that basis, risking tax non‑compliance. This illustrates how a single numeric lapse can cascade into legally consequential advice.Travel refunds (consumer rights)
One assistant answered as if passengers “always” receive refunds for cancelled flights, omitting nuance in EU261/UK261 rules, timing, and rerouting remedies. Misstating that a refund is automatic can mislead users into abandoning a complaint or accepting the wrong remedy.Questionable referral behavior
Several assistants suggested commercial third‑party tax‑reclaim or premium refund services rather than pointing users at HMRC guidance or free government tools — a real consumer‑protection issue that raises the specter of referral bias and undisclosed commercial pathways.Cross‑checking the claims — corroboration and caveats
Independent outlets and specialist reporters have echoed the Which? and EBU/BBC findings, producing a consistent narrative across multiple domains. MLex and TechRound summarised the Which? consumer test and its rankings, while Forbes, the EBU press release, and multiple media outlets reported the 45% significant‑issue figure from the newsroom audit. These independent reports converge on the core claims while noting that scores and exact percentages are snapshots tied to specific prompt sets and model versions. Readers should treat single percentage figures as indicative of systemic patterns rather than definitive, immutable rankings.Where precise details could not be independently corroborated — for example, internal telemetry or vendor‑specific mitigation timelines — the public reports explicitly flagged those items as time‑bound or requiring follow‑up. That caution is sensible: the model landscape is volatile, and a single product update can change behavior within weeks.
Practical guidance for consumers and Windows users
AI assistants are useful time‑savers, but these tests underline the discipline required to use them safely for consumer decisions.- Treat chatbots as research assistants, not advisers. Use them to compile checklists, draft questions, and surface potential routes — but confirm final, binding actions (tax payments, refunds, clinical decisions) with primary authorities or licensed professionals.
- Be jurisdiction‑specific in your prompts. Tell the assistant your location and the relevant regulatory scope (e.g., “UK ISA allowance for tax year 2025/26”) to reduce cross‑jurisdiction errors.
- Demand and verify sources. Ask the assistant to list timestamped links to official pages (HMRC, NHS, regulator pages). If it points to Reddit, a forum thread, or a thin commercial page, treat that as a red flag and verify with authoritative sites.
- Cross‑check numeric thresholds. When answers depend on figures (tax limits, deadlines), verify directly on the regulator’s portal rather than accepting the assistant’s number.
- Preserve an audit trail. Save important exchanges and the primary sources cited. That record can be vital if you need to contest a consumer decision or explain why you acted in a given way.
- Use research modes and citation features for high‑stakes queries. Prefer assistants with transparent citation modes and visible provenance when seeking factual answers. Perplexity’s design emphasis on visible sources was repeatedly cited as an advantage in these audits.
Enterprise and IT implications
For IT teams and procurement officers, the Which?/EBU results carry immediate governance implications:- Risk assessment: Factor assistant reliability into vendor selection, especially for decision‑support tasks touching legal, compliance, HR, finance, or safety domains.
- Policy controls: Implement usage policies that restrict assistants from making binding changes (transferring funds, publishing legal statements) without human sign‑off.
- Audit and logging: Ensure that AI interactions are logged, versioned, and retained for compliance review. That audit trail can be essential for incident response and regulatory inquiries.
- Vendor SLAs and transparency: Require vendors to disclose grounding strategies (how web retrieval is curated), update cadences, and mechanisms for flagging uncertain answers.
- Hybrid human‑in‑the‑loop workflows: Design processes where assistants draft and humans verify — this preserves the productivity benefit while mitigating downstream harm.
Vendor mitigations and product design choices
Vendors have responded to audits with a range of mitigations — but implementation and efficacy vary:- Citation and provenance features: Some assistants now surface links and summaries from specific pages; others provide “research” or web‑connected modes designed to improve traceability. These features help but do not eliminate hallucinations or retrieval of low‑quality content.
- Safety wording and refusals: Tools differ in their refusal rates. Systems tuned for low refusal rates and high engagement may give more confident, but riskier, answers; those tuned for caution refuse more often but can frustrate users.
- Prompt engineering and verification layers: Organizations can add verification steps — e.g., forcing a model to cite at least two independent authoritative sources for high‑stakes outputs — but that requires product and policy alignment.
Policy and regulatory takeaways
The tests highlight structural governance questions for regulators and platform providers:- Standardized provenance: Industry and policy makers should nudge for machine‑readable provenance and stronger citation standards. Clear attribution that ties claims to primary authorities is a minimum bar for consumer applications.
- Independent auditing: Regular, repeatable independent testing — like the EBU/BBC and Which? projects — should be part of the accountability toolkit for consumer protection authorities, media regulators, and procurement rules.
- Consumer education: Public campaigns are necessary to teach basic verification habits: demand sources, confirm numbers, and treat AI as an assistant not an authority.
Critical assessment — strengths, limits, and risks
Notable strengths
- AI assistants are fast, broadly accessible, and effective for low‑stakes triage, drafting and summarization tasks.
- Platforms that emphasize transparent sourcing and research modes demonstrably reduce verification friction for users.
- Independent audits like Which? and the EBU/BBC study provide actionable, reproducible frameworks for assessing real‑world risk.
Key limitations and risks
- Snapshot nature of audits: Results reflect specific model versions and prompt sets; updates can change behavior quickly. Treat rankings as temporal rather than permanent.
- Commercial and referral bias: Assistants that surface third‑party commercial links can steer consumers toward paid services, intentionally or not — a regulatory blind spot where consumer‑protection rules need to catch up.
- Human behavior risk: Automation bias (users trusting confident AI answers) amplifies the harm potential — a technological error becomes a consumer loss when trusted and acted upon.
- Scale of downstream harm: Financial, legal, and health domains carry outsized risk. A misleading tax figure or incorrect medical caveat can impose tangible costs or endanger health.
Unverifiable or time‑sensitive claims
Some media summaries vary slightly in per‑model percentage points; where variation exists, it reflects different sample subsets, scoring rubrics, or prompt phrasing. Similarly, vendor statements about “fixed” issues must be validated by independent re‑testing, because product updates can change model behavior rapidly. Any claim that a specific vendor “has fixed X” should be treated as provisional until independently re‑verified.Practical checklist: safer consumer use of AI chatbots
- Ask for sources and click through to the primary authority.
- Verify numeric thresholds (tax, benefits, deadlines) on official regulator sites.
- Use two independent information channels before acting on high‑stakes advice.
- Save and timestamp important exchanges and their cited sources.
- For legal/medical/financial decisions, consult a licensed professional before executing actions.
- For enterprises: log assistant interactions, require human sign‑off, and include provenance checks in procurement checks.
Conclusion
The Which? consumer test and the EBU/BBC newsroom audit paint a consistent, cautionary picture: AI chatbots have reached a level of usability and ubiquity that makes their errors consequential. They are valuable research and productivity tools, but the gap between polished conversational delivery and grounded, jurisdiction‑sensitive correctness remains wide enough to matter for consumers and organizations alike. The practical response is not to eschew these tools, but to adapt how they are used: insist on provenance, verify numeric facts against primary authorities, and design human‑in‑the‑loop workflows where stakes are high. Independent auditing, standardized provenance, and consumer education are the pragmatic levers that will reduce harm as conversational AI becomes further embedded into daily decision‑making.Source: Digital Information World UK Study Finds Popular AI Tools Provide Inconsistent Consumer Advice
Last edited: