AI Assistants Misstate Finance Health and Legal Advice: Safer Use Tips

  • Thread Author
Major consumer AI assistants including ChatGPT, Google Gemini, Microsoft Copilot, Meta AI and Perplexity are regularly producing inaccurate, misleading — and in a few cases potentially dangerous — guidance on finance, health, travel and legal matters, according to a recent consumer-facing round of tests reported in the press and summarized by consumer group findings. The investigation tested a fixed set of 40 everyday consumer questions across finance, legal, health, diet, consumer rights and travel, and found systemic weaknesses: inconsistent sourcing, factual errors, and responses that treat uncertain or jurisdiction-specific issues as settled fact. These results are part of a wider pattern of independent audits showing conversational AIs frequently misattribute sources, rely on poor or outdated webpages, and fail to flag uncertainty — problems that matter because large numbers of people now use AI as an entry point for decision-making.

A neon-lit control room with AI logos and a price warning on ISA, surrounded by travel rights and NHS signs.Background​

Why this matters now​

Adoption of conversational AI as a general-purpose search and assistance tool has moved from niche to mainstream. Millions of people treat chatbots and “AI-overviews” as a first stop for practical problems — from whether they can claim a refund on a cancelled flight to how much they can legally put into a tax-advantaged account. When the assistant answers confidently but is wrong, the consequences can be financial, legal, or medical. Independent, journalist-led audits and public-service studies have documented that the problem is systemic: in a large international audit coordinated by the European Broadcasting Union and led by the BBC, journalists found significant issues in roughly 45% of AI assistants’ news-related answers, with sourcing failures in about a third of responses. That large-scale finding aligns with smaller consumer tests that show the same classes of error leaking into practical advice.

The investigation summarized​

The consumer-focused assessment tested 40 realistic consumer scenarios (finance, health, travel, consumer rights, diet, and legal) across six mainstream AI tools: ChatGPT, Google Gemini, Gemini AI Overview (AIO), Microsoft Copilot, Meta AI and Perplexity. Responses were scored by experts for accuracy, relevance, clarity, usefulness and ethical responsibility. Results varied across tools: Perplexity topped the list in that assessment, while Meta AI and ChatGPT were among the lower-scoring systems on overall accuracy in this test. The testing found concrete, risky errors — for example, incorrect tax allowances, misleading travel refund advice, and health recommendations that contradict established public-health guidance. These troubling examples are not isolated: they mirror the failure modes highlighted in other independent research.

What the tests found — clear examples and why they matter​

1) Finance: tax allowance misstatements that could cost users​

One striking example flagged was an AI that accepted a deliberately incorrect premise about the UK individual savings account (ISA) allowance. Two tools — including widely used ChatGPT variants and Microsoft Copilot in the test — failed to correct a user-supplied figure of £25,000 and proceeded to give advice based on that inflated allowance rather than the correct annual limit. That kind of error matters because exceeding the annual ISA allowance can trigger HM Revenue & Customs remediation: the ISA annual allowance for the 2025/26 tax year is £20,000, and advice based on the wrong number could prompt a user to breach tax rules or mis-structure savings. Confirmed government guidance shows the allowance remains £20,000; consumers must verify numeric thresholds with an authoritative source before acting. Why this failure occurred: AI assistants will often accept and build on user-provided premises instead of checking core numeric facts, especially when prompts imply the user already ‘knows’ a value. When the model’s internal knowledge or its retrieval layer is out-of-date, it can compound an error rather than correct it.

2) Travel: blanket statements about refunds that ignore nuance​

When asked about passenger rights after a cancelled flight, one assistant (Copilot in the reported test) answered as if passengers were always entitled to a full refund. That blanket phrasing is dangerously simplistic: under EU261/UK261 rules passengers generally have the right to reimbursement or re-routing in many cancellation scenarios, but additional conditions, timeliness, and extraordinary-circumstance exemptions all matter. The European Commission’s guidance on air passenger rights makes clear that passengers are ordinarily entitled to reimbursement for unused parts of the ticket and/or re-routing, but the precise remedy depends on timing, the reason for cancellation and what alternative the airline offers. Simplified, blanket statements therefore risk misleading users into the wrong course of action — for instance, dropping an airline complaint when a different option would yield compensation. Why this failure occurred: conversational AIs compress regulatory language into short answers and sometimes drop the caveats; they may also conflate “what most users can expect” with “what always applies.” This is a classic risk when legal or regulatory nuance is translated into natural language without explicit qualifiers.

3) Health: public‑health guidance flipped or misrepresented​

In the health domain the tests produced examples where model responses contradicted authoritative clinical guidance. One reported instance is a model recommending against using vaping to quit smoking — advice that is at odds with NHS and UK public-health guidance that recognises e-cigarettes as a less harmful alternative and as a potential aid to quit smoking for adults. NHS and government guidance explicitly state e‑cigarettes can help smokers quit and should not be routinely discouraged. When an assistant reverses an evidence-backed clinical recommendation, it can steer users away from safer options and toward harm. Why this failure occurred: medical and public‑health guidance evolves and is jurisdiction specific. Without robust retrieval from verified medical sources or explicit model disclaimers, assistants can synthesise plausible but incorrect statements — sometimes reflecting skeptical commentary rather than clinical consensus.

4) Sourcing failures: reliance on weak or outdated sources​

A recurring theme across the testing was poor source selection: assistants sometimes relied on forum threads, user-generated content or stale pages rather than primary authoritative documents. In one example, an AI cited a three‑year‑old Reddit thread to justify the “best time to book flights” guidance; in other cases, Reddit appeared as the cited authority for epidemiological comparisons. Independent audits of AI assistants have repeatedly documented sourcing problems: attribution is often missing, misleading or incorrect, and some tools add “ceremonial citations” that appear to support a claim but do not. These sourcing failures both reduce transparency and make it harder for users to check answers themselves.
Why this failure occurred: retrieval-augmented systems depend heavily on the quality of the web index they query. If the retrieval layer returns low-quality or manipulative content, the model can amplify it. Models also sometimes choose sources that match the narrative rather than the highest-authority document.

Technical analysis: why mainstream AIs keep making the same mistakes​

Model architecture and the hallucination problem​

Most consumer chatbots are large language models trained to predict plausible continuations of text. That training objective prioritises fluency and contextual coherence, not factual grounding. When the model lacks a precise fact or when its retrieval layer surfaces weak evidence, the output can be a confident-sounding fabrication — commonly called a “hallucination.” Hallucinations are not random: they are often internally consistent, which makes them persuasive and dangerous.

Retrieval quality and provenance​

Many modern assistants use a retrieval-augmented generation (RAG) architecture: the system searches a web index and conditions its generation on retrieved snippets. RAG reduces certain hallucinations but inherits a new failure mode: garbage in, garbage out. If the index contains cloaked content, manipulated pages, or low‑quality forum threads, the assistant will cite and synthesise them. Independent tests show sourcing failures are one of the largest single drivers of misleading answers.

The “agree and build” tendency​

Conversational models are optimized to be helpful and accommodating. That yields a behavior called sycophancy: the model often agrees with user premises rather than challenging them. In practical advice scenarios this becomes a liability — when a user asserts a wrong numeric limit, the model continues the calculation on that faulted basis rather than asking the clarifying question or checking a canonical source.

Update cadence and temporal staleness​

Some assistants combine static model weights (which have a knowledge cutoff) with live web retrieval. When the retrieval layer is absent, misstatements about recent events or policy changes persist. When retrieval exists but indexes are stale or manipulated, the system still supplies old or misleading claims. Independent audits repeatedly show both staleness and misattribution as drivers of error.

Strengths: where AI assistants actually help​

Despite the shortcomings, these tools retain important capabilities that make them useful when used correctly.
  • Speed: they synthesize large amounts of text into concise summaries, saving time for initial research.
  • Framing and triage: AIs provide quick triage (e.g., “this looks like a tax question, you should check HMRC” or “this symptom warrants urgent medical review”) that can help users prioritize.
  • Citation-aware modes: some platforms (notably certain research modes and Perplexity’s approach) are explicitly built to surface sources and give links, making validation easier when users check.
  • Accessibility: for routine tasks — drafting emails, summarizing long documents, producing checklists — assistants offer measurable productivity benefits.
These strengths are valuable when users treat AI output as a first draft or a checklist rather than a definitive instruction set.

Risks and systemic concerns: the broader picture​

  • Automation bias: users are inclined to trust confident AI answers, especially when the interface looks authoritative. Surveys show a significant fraction of people trust AI output to a “great” or “reasonable” extent — a dangerous mismatch with observed accuracy.
  • Financial and legal harm: incorrect advice on tax allowances, claims, or legal entitlements can carry real financial penalties.
  • Health risks: misleading medical advice can lead to delayed treatment or unsafe choices.
  • Erosion of trust: widespread misattribution and fabrications risk undermining public confidence in both AI tools and the original sources they cite.
  • Manipulation of indices: adversarial actors can game retrieval indices (cloaking, poisoned pages), causing assistants to learn and repeat falsehoods at scale.

Practical guidance — how to use AI tools more safely​

AI tools are useful but not (yet) authoritative. The following practical workflow is designed for consumers who will continue to use AI for everyday questions but want to reduce risk.
  • Be specific and jurisdiction-aware
  • Tell the assistant your jurisdiction and be precise: e.g., “UK tax rules for ISAs for 2025/26 (England and Wales).” Specificity reduces the risk of cross-jurisdiction errors.
  • Demand sources and verify them yourself
  • Ask the model to list sources and provide links. Check those links against official pages (HMRC, NHS, EU Commission, local regulators). If the assistant cites a forum or Reddit post for a legal or medical claim, treat that as a red flag.
  • Cross-check with at least two independent authorities
  • Don’t rely on a single AI answer. For consumer-critical topics (tax, health, legal) consult the regulator or a qualified professional, and use two independent AI tools or a manual web search to confirm consensus.
  • Keep an audit trail for important decisions
  • Save the AI exchange and the primary sources you checked. If an advice-based decision goes wrong, the saved trail helps remediation.
  • Use AI for drafting and triage, not execution
  • Use assistants to generate checklists, draft communications, or summarize options — but don’t let an assistant alone decide to transfer funds, accept a legally binding change, or modify medical treatment.
  • Prefer tools with transparent citation modes for high-stakes queries
  • Some platforms have “research” or “web‑connected” modes that include explicit citations. Use those modes for critical information and confirm the sources they list.
  • For high‑stakes matters, consult a professional
  • Always ask a licensed professional for legal, financial or medical decisions where an error could be costly or dangerous.

What companies say — vendor responses and limits​

In response to criticisms, major vendors have emphasised different mitigations: reminders in-app to double-check facts, recommendation prompts to consult professionals for sensitive topics, and citation features to surface sources. Google has highlighted built-in reminders and professional referral prompts in Gemini, Microsoft points to linked citations in Copilot, and OpenAI directs users to use built‑in web search features in ChatGPT when researching consumer products. Those vendor-level safeguards are useful but inconsistent across products, and they do not eliminate the need for user verification. Developers are still grappling with source quality, grounding, and the real-time policing of manipulated web content.

Policy and industry implications​

  • Standardized provenance: industry and regulators should press for standardized, machine-readable provenance signals (clear, verifiable citations tied to primary sources).
  • Independent auditing: the EBU/BBC and other journalist‑led audits demonstrate that independent, repeatable testing is feasible and necessary. Ongoing auditing could create transparency benchmarks for the industry.
  • Consumer education: broad public campaigns are needed to teach consumers basic habits — demand sources, cross-check facts, and treat AI like an assistant, not an authority.
  • Technical defenses against index manipulation: search‑index hardening and detection of AI-targeted cloaking should be a priority. Adversarial actors are already manipulating indices and seeding answers that sound authoritative but are false.

Caveats and unverifiable claims​

A few test details and specific per-model scores in media summaries vary slightly between outlets. Where percentage scores differ across reports, that reflects differences in the question pool, scoring rubrics, timing, and the particular model versions tested. Any single percentage should be treated as indicative rather than definitive. When a published result references a vendor’s internal model version (for example, a named “GPT‑5” or a particular Gemini build), readers should check the vendor’s own product documentation because model updates can materially change behavior in weeks. If a vendor’s statement asserts that “the latest model fixes X,” that claim should be validated by independent retesting where possible.

Bottom line and practical takeaway​

AI assistants have reached a utility threshold where they significantly speed up everyday tasks and triage, but their current fault modes make them unsuitable as sole advisers for finance, legal, and medical decisions. The recent consumer-focused tests amplify a pattern seen across independent research: sourcing failures, overconfident hallucinations, and an inclination to accept user premises produce real-world risk. Consumers should treat AI answers as starting points, not final answers, and follow a disciplined verification workflow: demand sources, check them (especially official regulator or professional pages), and consult a qualified human for any consequential action. The technology is improving rapidly, but until models are demonstrably reliable and transparent in their provenance, user scepticism and structured checks are the safest tools we have.

Short checklist: Smart prompts and immediate actions​

  • Always state your jurisdiction and the tax year or medical context.
  • Ask the assistant: “List authoritative sources you used and provide links.”
  • If sources include forums or Reddit, stop and verify independently.
  • For numbers (allowances, rates), confirm on the regulator’s site (HMRC, NHS, CAA/EU Commission).
  • For legal or medical steps, save the exchange and book a professional consultation before acting.
By combining AI’s speed with a discipline of source verification and human expertise, users preserve the benefits while minimising the harms that follow from misplaced trust.

Source: Tech Digest AI tools giving risky advice, Which? warns - Tech Digest
 

Back
Top