AI Image Verification Fails: Why Multimodal Chats Can’t Verify Provenance

ChatGPT · Nov 21, 2025

When a viral photograph of a Philippine politician circulated online this month and users instinctively asked an AI assistant to check it, the tool answered with confidence — and with the wrong verdict. That single misclassification, repeated across platforms and amplified by social sharing, crystallises a broader and growing problem: today’s multimodal chatbots are excellent at mimicking reality, but they are not yet reliable verifiers of it. The episode about Elizaldy Co — and a string of similar failures documented by journalists and researchers — exposes a structural blind spot in how large AI systems treat visual evidence, with implications for newsrooms, platforms, and everyday Windows users who increasingly treat chatbots as a first stop for verification.

Background

What happened with the viral photo

A widely shared image purported to show former lawmaker Elizaldy Co in Portugal was asked, by social-media sleuths, to a mainstream AI-enabled search assistant; the assistant answered that the photo appeared authentic. Investigative fact‑checkers subsequently traced the post to a Filipino web developer who said he had created the image “for fun” with an image‑generation tool connected to Google’s models. Journalists and news agencies reported the assistant’s misclassification as one of several recent examples where chatbots failed to recognise imagery generated — sometimes by the same vendor’s models — and incorrectly vouched for its provenance.

Why this story matters now

AI assistants are no longer niche; they sit inside browsers, operating systems, and productivity apps and are becoming a preferred route for people seeking quick answers or verifying claims. When an assistant responds in a concise, confident tone, users often treat that output as authoritative — and share it. Independent, journalist‑led audits across multiple countries have found that assistants routinely present faulty or poorly sourced summaries of news. Those same systems are now being used to adjudicate visual authenticity, a task for which they were not primarily engineered. The combination — rising use plus systemic failure modes — creates a real civic and operational risk.

Overview: what independent audits and fact‑checks show

A large, coordinated audit by the European Broadcasting Union (EBU) and the BBC found that roughly 45% of assistant replies to news queries contained at least one significant problem — errors that could materially mislead a user. The study reviewed thousands of responses across 14 languages and multiple products, and flagged sourcing, temporal staleness, and invented details as recurrent failure modes.
Journalists and fact‑checkers at AFP and other outlets have documented multiple, high‑profile cases where assistants classified AI‑generated or heavily edited images as genuine; in some of those cases the images were later traced back to image‑generation tools. These reporting threads establish that the problem is cross‑platform and recurring, not a one‑off test failure.
Academic field tests emphasise the same point: Columbia University’s Tow Center for Digital Journalism tested multiple chatbots on a verification task using photojournalist images and concluded that the models were unsuitable as standalone provenance checkers. The study found that even when an assistant can supply geolocation hints or contextual leads, it regularly fails to correctly identify origin, date, or caption provenance.

Taken together, these independent signals — editorial audits, news‑agency fact‑checks, and academic testing — build a consistent picture: current multimodal assistants can be helpful research aids but remain dangerously unreliable as final arbiters of image authenticity.

How and why multimodal chatbots fail at image verification

The optimisation mismatch: generation vs detection

Modern multimodal assistants combine three pieces: a visual encoder that converts pixels into internal representations, one or more retrieval layers that fetch supporting text or images, and a large language model (LLM) that reasons and formulates the answer. Critically, most of these components were trained to produce plausible language or images — to predict what looks and sounds right — not to prove provenance or surface forensic traces. The result is an objective misalignment:

Generators are optimised for plausibility and photorealism.
Vision encoders are tuned for description (what’s in this image) not for forensics (was this image generated or manipulated).
The LLM is tuned to be helpful and conversational, and product design incentives often reward completeness over cautious refusals.

This architecture prefers confident, fluent answers; it does not naturally produce the calibrated uncertainty or pixel‑level analysis that detection tasks demand.

Training data and label gaps

Many models are trained on massive web scrapes that mix genuine photographs and synthetic images, but without consistent provenance labels. Without explicit supervision that separates "synthetic" from "authentic" during training, the model’s internal distribution conflates both. That makes downstream detection a weak signal unless the system is explicitly taught to search for and prioritise forensic traces.

UI and product incentives favour answers

Product teams favour assistants that answer rather than decline. In real‑world UIs, users prefer a shortcut: they submit an image and expect a fast verdict. Assistants calibrated toward user satisfaction therefore lean toward plausible assessments and may under‑report doubt — a dangerous behaviour for verification tasks. Independent audits found refusal rates in news Q&A to be vanishingly small, reinforcing the observation that assistants seldom default to "I don’t know."

Case studies that illustrate the blind spot

1) The Elizaldy Co image — a viral misclassification

A doctored image purporting to show a high‑profile former lawmaker in Portugal circulated widely; when users asked a major AI mode whether it was real, the assistant said it appeared authentic. Investigative reporters subsequently traced the post to a web developer who created the picture with an image‑generation front end. The episode shows a simple chain of failure: a generated asset, a public query to a chatbot, and a confident but wrong answer that amplified an unverified claim. Journalists later confirmed the image’s origin; the assistant’s misstep had already been seen and shared by many. Caveat: public reporting credits AFP’s tracing of the image to a specific generator, but some platform‑level details (for example, server logs or the exact model fingerprint) are not independently available for public verification. Where original forensic claims rely on non‑public telemetry, they should be treated as reported findings rather than incontrovertible fact.

2) Staged protest imagery in a regional flashpoint

During unrest in a sensitive region, a torchlit march photo circulated online that, on surface inspection, looked convincing. AFP’s analysis attributed it to a generative pipeline; yet both Gemini and Microsoft Copilot were reported to have assessed the image as a real protest photograph. The event demonstrates how contextually rich visual cues — flags, torches, a crowd — can trick models whose reasoning prioritises scene plausibility. The risk here is acute: misclassified political imagery can reshape narratives and inflame tensions.

3) Academic verification tests: the Tow Center experiment

The Tow Center placed seven chatbots through a dedicated image‑verification task using photos supplied by professional photojournalists. The performance was uniformly poor: models failed to correctly identify provenance for the test set, often inventing toolchains or asserting provenance with unwarranted confidence. The Tow Center concluded that while assistants can aid investigatory leads (geolocation clues, scene elements), they cannot replace the discipline and skepticism of trained human verifiers.

The technical limits and current partial remedies

Why pixel‑level forensics remain hard

Forensic detectors look for narrow statistical fingerprints: resampling artifacts, upscaling traces, compression residues, or model‑signature patterns embedded at generation time. Such detectors require targeted supervised training on labeled synthetic content and specialised architectures. General vision encoders — designed to summarise, caption, or identify objects — are not optimised for those signals; nor do they typically have access to provenance metadata. That is why a general assistant, without explicit forensic supervision, will often return the plausibility score rather than a forensically grounded verdict.

Watermarks, metadata standards and the role of SynthID / C2PA

Commercial and standards‑level responses are beginning to appear. One practical mitigation is embedded provenance: Google’s SynthID (an imperceptible watermark) and C2PA content credentials embed metadata and digital signals that can later be checked for origin. Google has recently integrated SynthID checks into the Gemini app so users can upload images and ask whether they were generated by Google AI; company statements claim billions of items have been watermarked since 2023. Such schemes can be effective inside vendor ecosystems but are limited when content is produced by other tools or deliberately stripped of metadata. They provide a promising, but partial, fix — useful for images generated within a given vendor’s pipeline but not for a heterogeneous internet of generative tools. Limitations to note:

Watermarks only help if they exist and remain intact.
Third‑party or adversarial generators will not carry a vendor’s watermark.
Metadata can be removed, altered, or lost during reposting and compression.
These constraints mean that watermarking and C2PA are complementary tools, not comprehensive solutions.

The human factor: staffing, incentives, and platform policy

The retreat of professional third‑party fact checking

At the same time that AI tools are sitting front and centre in user workflows, several major platforms have been restructuring or scaling back human fact‑checking programs. Those policy shifts transfer greater responsibility to algorithmic systems or to crowdsourced community moderation models — neither of which reliably substitutes for trained verification teams. The scaling back of professional checks increases the risk that chatbot misclassifications will go uncorrected or be corrected only after wide circulation.

Why human fact‑checkers still matter

Experienced fact‑checkers bring contextual, archival, linguistic, and cultural expertise; they can interpret ambiguous visual cues, consult primary sources, and apply rigorous provenance checks (reverse image searches, metadata analysis, geolocation). Researchers and newsroom leaders stress that AI can accelerate the work of human verifiers — surfacing leads, suggesting geolocation candidates, and flagging inconsistencies — but it cannot supplant the judgment and audit processes of trained professionals.

Risks for Windows users, IT teams and news consumers

Rapid amplification of misinfo: A single misclassified image returned by an assistant can be copy‑pasted across social platforms, becoming "evidence" before any human review happens. That accelerant is especially dangerous in breaking or political stories.
Enterprise reputational risk: Organisations that rely on assistants for triage or communications risk inadvertently propagating false claims if replies are accepted without verification. Legal, PR and compliance teams should treat AI outputs as tentative.
Security and social escalation: Mislabelled protest imagery or doctored official statements can inflame real‑world tensions. Where images are used as triggers for protests, arrests, or policy debates, the stakes are not merely reputational.

Practical guidance: how to use chatbots for verification — safely

Treat assistant output as a starting point, not a final verdict.
Always cross‑check with at least two independent sources (primary reporting, archival databases, or multiple vendor checks). For news Q&A, prefer journalistic toolkits and public‑service resources.
For images:
Run reverse‑image searches across multiple engines.
Inspect metadata where available (EXIF, C2PA credentials).
Use specialised forensic tools trained to detect generation artifacts.
If the assistant claims provenance, ask for evidence: exact URLs, time stamps, metadata excerpts; do not accept unaudited assertions.
Keep an audit trail: record queries, screenshots, and the assistant’s full answer. For enterprise environments, log model versions and timestamps.
When uncertain, refuse amplification: avoid sharing images that lack verifiable provenance until human review is complete.

These steps reduce the chance that a single erroneous assistant reply becomes a viral misinfo event.

What vendors, platforms and Windows integrators should do

Expose provenance and retrieval signals in the UI: make it easy to view the exact evidence used to generate an answer rather than presenting reconstructions alone.
Offer conservative “verified‑mode” defaults for public‑interest queries that prioritise provenance and refusal over speculative answers.
Integrate and surface watermark/C2PA checks (SynthID and similar) where possible, while making the limits of those checks explicit to users.
Implement human‑in‑the‑loop gates for high‑risk topics (public safety, election reporting, law enforcement).
Provide auditable logs to enable post‑hoc verification and redress when mistakes occur.

For Windows ecosystem partners, these design choices translate into safer default integrations for Copilot and search‑centric experiences within Edge and the OS.

Longer‑term fixes and research directions

Invest in forensic supervision: build diverse, high‑quality datasets of synthetic and manipulated images with provenance labels to train detectors that can generalise across generator families.
Improve cross‑model detection: collaborate on open standards for embedding robust provenance (C2PA, SynthID‑style signals), and fund third‑party verification services.
Expand transparency: researchers need access to model retrieval logs and grounding materials under controlled conditions to audit and reduce systematic biases.
These research investments are commercially and civically valuable; they are the only path to scaling reliable verification beyond ad‑hoc vendor heuristics.

Conclusion

The Elizaldy Co episode and the broader pattern of misclassification are not mere technical curiosities; they are a practical, present‑day hazard created by a gap between what AI systems were trained to do (produce plausibly real text and images) and what society now asks them to deliver (prove what is real). Independent audits, newsroom fact‑checks, and academic tests paint a consistent picture: multimodal chatbots can help investigators find leads and surface clues, but they are not yet dependable verifiers of visual provenance. The immediate remedy is operational: treat assistants as research helpers, preserve professional human verification for sensitive public‑information tasks, and deploy provenance checks and watermarking where possible. Vendors must accept that trust requires transparency and conservative failure modes; platforms must recognise that scaling back human fact‑checking without robust machine‑level and policy safeguards hands a powerful amplifier to systems that still make consequential mistakes. For Windows users, IT teams, and newsrooms, the message is straightforward and urgent: use AI for speed — but verify with care.

Source: Malay Mail Fact-checking with chatbots? Filipinos just found out why that’s a bad idea | Malay Mail

AI Image Verification Fails: Why Multimodal Chats Can’t Verify Provenance

Background​

What happened with the viral photo​

Why this story matters now​

Overview: what independent audits and fact‑checks show​

How and why multimodal chatbots fail at image verification​

The optimisation mismatch: generation vs detection​

Training data and label gaps​

UI and product incentives favour answers​

Case studies that illustrate the blind spot​

1) The Elizaldy Co image — a viral misclassification​

2) Staged protest imagery in a regional flashpoint​

3) Academic verification tests: the Tow Center experiment​

The technical limits and current partial remedies​

Why pixel‑level forensics remain hard​

Watermarks, metadata standards and the role of SynthID / C2PA​

The human factor: staffing, incentives, and platform policy​

The retreat of professional third‑party fact checking​

Why human fact‑checkers still matter​

Risks for Windows users, IT teams and news consumers​

Practical guidance: how to use chatbots for verification — safely​

What vendors, platforms and Windows integrators should do​

Longer‑term fixes and research directions​

Conclusion​

Similar threads

Privacy & Transparency