A fresh round of independent audits has delivered a blunt message to anyone treating chatbots as authoritative assistants: conversational AI is useful, but still unsafe to trust without verification. A UK consumer test of six mainstream chatbots gave the best performer — Perplexity — roughly a 71–72% reliability score on 40 real‑world consumer queries, while other household names lagged behind and produced a steady stream of glaring errors on finance, travel, legal and health topics. Those consumer results sit alongside a large, journalist‑led audit showing roughly 45% of news‑related AI answers contain at least one significant problem. The upshot for Windows uusers, IT teams and anyone building AI into workflows is straightforward: AI chatbots can accelerate work, but they frequently trade fluency for factual grounding, and that tradeoff has measurable costs. )
Source: The Star | Malaysia 'Glaring errors': You still can't trust any AI answer, research shows
Background
What the consumer test measured
Which?, the UK consumer group, staged a practical, expert‑evaluated drill: 40 realistic, jurisdiction‑sensitive consumer prompts spanning personal finance, travel refunds, consumer rights, simple legal entitlements and health/diet questions. Each of six mainstream assistants—Perplexity, Google’s Gemini (including its “AI Mode”/overview), Microsoft Copilot, OpenAI’s ChatGPT, and Meta’s assistant—received the same prompts and was scored by subject experts for accuracy, clarity, usefulness and ethical responsibility. The ranking showed wide variance: Perplexity at the top (roughly 71–72%), Google’s AI Mode and Gemini close behind, Microsoft Copilot and ChatGPT mid‑pack, and Meta AI trailing. The testing highlighted concrete mistakes with real tax thresholds, over‑confident legal assertions, and health or nutrition guidance that contradicted authoritative sources.What the newsroom audit found
A complementary study coordinated by the European Broadcasting Union and led by the BBC evaluated more than 3,000 AI replies to news queries across 14 languages and 18 countries. Journalists scored responses against newsroom standards—accuracy, sourcing/provenance, context and separation of fact from opinion—and found that 45% of replies contained at least one significant issue. Sourcing failures were common: about 31% of responses had missing, misleading or incorrect attributions, and roughly 20% showed major factual or temporal errors such as inventing events or naming the wrong officeholder. The study concluded these failure modes are systemic and multilingual, not isolated to a single vendor or language.The evidence in plain terms
Fluency is not the same as factuality
Modern chatbots are engineered to produce fluent, conversational text. That fluency makes errors feel plausible: confident prose masks the difference between a well‑sourced answer and a statistically plausible fabrication. Tests rets will invent dates, misquote sources, or accept user‑supplied false premises (for example, an intentionally wrong ISA allowance) and proceed to give dangerous advice rather than flagging uncertainty. This mismatch—style over substance—is central to the reliability problem.Sourcing and provenance failures dominate
The newsroom audit’s most common failure mode was broken or ceremonial sourcing: answers that appear to cite evidence but either point to the wrong document or attach an attribution that does not support the claim. When a user cannot follow an assistant’s chain of evidence, the ability to verify claims disappears; the result is a polished but unverifiable narrative. This is particularly dangerous for news summaries, legal or medical queries, and any decision where provenance matters.Hallucination remains a practical hazard
“Hallucination” is the industry term for plausible‑sounding but fabricated facts. Independent audits, trick‑prompt experiments and reporter red teams keep finding hallucinations in all major assistants. These are not harmless typos—audits recorded fabricated legal cases, altered quotes, and invented events that would materially mislead a user if accepted unchecked. The rising use of real‑time web retrieval has reduced non‑response rates but, in some audits, increased the percentage of wrong answers because models now digest a noisy and hostile web as a source of “facts.”Users often still trust and act on AI outputs
Surveys and behavioral data show many people treat AI summaries and chat replies as useful and, in some cases, trustworthy. Pew Research found more than half of Americans who had encountered AI summaries in search results judged them “somewhat useful,” and significant numbers said they at least somewhat s. At the same time, workplace studies reveal that many employees use AI outputs without adequate verification—creating an operational risk when AI mistakes propagate into business factors amplify the systemic risk created by model shortcomings.Why this matters for Windows users, enterprises and everyday consumers
When errors have real consequences
- Personal finance: a wrong numeric threshold (e.g., ISA allowance) can prompt tax non‑compliance or costly restructuring. Some assistants accepted false premises and calculated advice upon them, rather than challenging the premise.
- Travel and consumer rights: blanket claims (e.g., “you are always entitled to a full refund”) ignore jurisdictional sers away from appropriate remedies.
- Health and nutrition: misstatements that contradict public health guidance can produce real physical harm when followed. Tests and case studies have documented scenarios where users acted on flawed AI health guidance with negative outcomes.
Windows and Microsoft Copilot implications
Microsoft’s Copilot is integrated into Windows and Office for many users; that integration makes errors more consequential because outputs are now embedded into documents, e‑mails and decision logs. Enterprises that enable Copilot or similar assistants must treat those outputs as drafts requiring verification, not final legal or financial advice. The same applies to any AI that writes policies, summarises compliance documents, or assists in customer communications.The workplace governance problem
KPMG and the University of Melbourne’s global survey shows high adoption but low governance: many employees use AI without formal authorization, and a substantial share rely on AI outputs without adequate fact‑checking—creating data‑exfiltration, privacy and compliance risks. The study found that a majority of employees who use AI at work do so without evaluating results thoroughly, and many conceal their use from managers. This makes technical fixes insufficient without policy, training and auditing.Strengths: where assistants already add value
- Rapid triage and summarising documents and produce readable summaries that help triage work or focus human attention on key items. This speeds research, reduces repetitive tasks, and can dramatically increase productivity when used as a drafting aid.
- Natural language accessibility: chat interfaces lower the barrier for non‑expert users to get a starting answer quickly—useful for brainstorming and ideation.
- Integration benefits: when combined with verified data connectors (enterprise knowledge bases, legal databases, EHRs) and strict provenance layers, assistants can become powerful productivity multipliers.
- Improvements in citation behaviour: some systems (notably research‑focused) treat citations and source snippets as first‑class outputs, which helps users verify claims more easily when those citations are honest and precise.
Risks and failure modes to plan around
- Sourcing that looks good but is wrong: ceremonial citations that point to unrelated pages or outdated forum threads. This is the single largest operational risk for misinformation. ([ebu.ch](Largest study of its kind shows AI assistants misrepresent news content 45% of the time – regardless of language or territory: models can confidently state facts that have recently changed (officeholders, laws, deadlines). Live‑web access helps, but only if retrieval layers and filtering are reliable.
- Acceptance of user falsehoods: chatbots often buithout checking them; that means a conversation starting from a false assumption can yield dangerously wrong recommendations.
- Data leakage and privacy: employees uploading proprietary material to public models risks IP and compliance breaches; misconfigured enterprise deployments can leak sensitive logs. KPMG’s survey found many employees have uploaded company data to public AI tools.
- Regulatory and legal exposure: incorrect AI‑generated statements used in filings, client communications, or legal briefs have already created liability in courts and professional settings. Hallucinated citations or fabricated cases are particularly hazardous in legal practice.
Practical, evidence‑backed rules for safe use
Below are operational steps to adopt immediately for anyone using or provisioning AI assistants in consumer or enterprise contexts. These are derived from the audit findings and the governreat AI outputs as drafts, not decisions. Always verify numerical thresholds, legal entitlements and health guidance with an authoritative source before acting.- Demand provenance. Favor tools that surface clear, clickable citations to original, authoritative sources and show retrieval timestamps and confidence levels. If a tool’s citations are vague or missing, treat the answer as suspect.
- Configure enterprise connectors to authoritative data only. When possible, point assistants at verified internal systems, licensed databases and curated sources rather than an unconstrained web crawl. This reduces hallucination risk.
- Train staff and create AI policies. Require training on how to verify AI outputs, restrict upload of sensitive content, and log AI usage for audit and compliance. The KPMG study found governance lags adoption—closing that gap is essential.
- Use clarifying prompts and challenge fallacies. Encourage users to ask follow‑ups like “What authority or statute supports this?” or “Show the exact passage you relied on.” Assistants that can’t show their evidence should not be relied on.
- Add human‑in‑the‑loop checks where risk is material. For legal, tax, medical or financial decisions, require a qualified human review before execution. This should be encoded in process workflows and SLOs.
- Log and monitor AI outputs in production. Keep an audit trail linking prompts, responses, and the human reviewer to enable post‑incident analysis.
- Be skeptical of summary‑first interfaces. Users often stop after a concise AI summary; encourage “click‑through” behaviour to original sources and surface the limitations and date of knowledge. Pew’s behavioural data shows users seeing AI summaries click through far less often.
Technical interventions that reduce risk
- Retrieval‑augmented generation with strict source whitelists: limit retrieval to vetted domains ories.
- Provenance metadata: require models to return the exact snippets and URLs (or internal document IDs) used to construct any factual claim.
- Fact‑checking layers: run high‑risk answers through a secondary check against canonical databases (legislation, tax codes, PubMed, etc.) before prese.
- Uncertainty calibration: force assistants to surface confidence scores and explicit hedges when evidence is weak; don’t allow categorical “yes/no” answers when the model’s retrieval returns low‑quality sources.
- Usage telemetry and anomaly detection: track when assistants shift from citing high‑quality sources to low‑quality forums or newly‑created sites—this can detect poisoning or grooming attacks early.
Critical analysis: what auditors got right — and what remains uncertain
Notable strengths of the audits
- Real‑world prompts and human expert grading make these tests operationally relevant. Which?’s consumer prompts mirrored everyday questions non‑experts actually ask; the EBU/BBC newsroom audit used working journalists to stress‑test tasks that matter to civic life. That focus on real usage is more informative than synthetic benchmarks.
- Cross‑vendor comparison clarifies variance. The fact that Perplexity scored higher on consumer prompts while other large vendors trailed shows product design choices (retrieval, citation hygiene, conservative refusal behaviour) materially affect reliability.
Where caution is needed
- Snapshot limitations: model behaviour changes rapidly. Vendor updates, live‑web connectors, and policy tweaks can materially alter performance within weeks. The percentages reported are snapshots, not permanent rankings; treat them as indicative rather than definitive.
- Sampling and prompt sensitivity: evaluation depends on prompt phrasing and scoring rubrics. Different prompt wording or a different set of questions can change rankings; the studies do not claim to cover all possible use cases. That variability means organizations must test assistants against their own mission‑critical workflows rather than rely solely on third‑party rankings.
- Partial corroboration of some survey claims: media summaries sometimes compress multi‑source claims (for example, trust metrics and workplace behaviours) into neat soundbites. When precise percentages matter for policy decisions, consult the primary reports directly. If the original datasets or methodologies are not publicmbers with caution and seek the underlying data.
- Unverifiable or unconfirmed citations: some widely circulated claims—especially single‑sentence statistics traced back through press coverage—lack a direct public dataset to reproduce them. Any such claim should be flagged and verified against the primary research before being relied on operationally.
How to think about “trust” in AI going forward
Trust is not a binary property of a model; it is a contextual, process‑driven judgement. A chatbot might be highly trustworthy for paraphrasing a company handbook when built on an internal index, but deeply untrustworthy for legal interpretation when pulling unsourced web pages. Good governance treats trust as the product of three things:- Data quality and provenance (what sources were used and are they authoritative?)
- Process controls (are there human reviewers and audit logs?)
- User competency and training (are people taught to verify and challenge AI outputs?)
Final verdict and takeaway for WindowsForum readers
AI assistants are powerful drafting and triage tools: they speed research, draft documents, and make large volumes of information more approachable. Yet independent audits — consumer‑focused tests and large newsroom evaluations — consistently show that fluency does not equal truth. The most practical course for Windows users and IT teams is to treat AI replies as starting points, not conclusions.- Use AI to accelerate routine work, but never to replace authoritative verification for legal, financial or medical matters.
- Configure enterprise AI to rely on curated internal sources and enforce human review where errors would be costly.
- Demand provenance, monitor outputs, and invest in training and clear policy—these operational steps reduce the real risks these audits expose.
Source: The Star | Malaysia 'Glaring errors': You still can't trust any AI answer, research shows