Generative AI chatbots are increasingly producing polished but fictional research records — invented journal titles, bogus DOIs, and archival catalogue numbers — a problem the International Committee of the Red Cross has publicly warned about and that recent audits and peer‑reviewed tests confirm is widespread enough to demand immediate policy and engineering responses.
Generative large language models (LLMs) such as ChatGPT, Google Gemini and Microsoft Copilot are designed to produce fluent, human‑like text by predicting likely continuations of input. That design excels for drafting, brainstorming and summarizing, but it also creates a structural failure mode: when asked to deliver research‑grade citations or archival references the models frequently produce plausible‑looking but non‑existent sources. The International Committee of the Red Cross (ICRC) explicitly cautioned that chatbots “may generate incorrect or fabricated archival references,” noting that LLMs can invent catalogue numbers, document descriptions and even platforms that never existed.
This phenomenon — often called citation fabrication or bibliographic hallucination — is not merely anecdotal. Controlled experiments and multi‑market audits show non‑trivial rates of fabrication in LLM outputs, and institutions reliant on archival accuracy report growing operational burdens verifying and disproving AI‑produced leads.
The path forward is practical and layered. Vendors must make retrieval‑first designs, provenance and verification APIs the default for source‑level queries; institutions must require human verification, update procurement contracts and adopt transparent AI‑use policies; and individual researchers, students, librarians and IT managers must treat AI outputs as leads to be verified, not as authoritative sources. Where those disciplines converge, AI can deliver its productivity benefits without eroding the foundations of research, law and public information.
The operational takeaways are clear: verify every citation, demand provenance, and treat AI as an assistant — not a substitute — for primary‑source research.
Source: NDTV Profit ChatGPT, Gemini, Copilot, Others Generating Research Papers, Journals That Dont Exist: Red Cross
Background
Generative large language models (LLMs) such as ChatGPT, Google Gemini and Microsoft Copilot are designed to produce fluent, human‑like text by predicting likely continuations of input. That design excels for drafting, brainstorming and summarizing, but it also creates a structural failure mode: when asked to deliver research‑grade citations or archival references the models frequently produce plausible‑looking but non‑existent sources. The International Committee of the Red Cross (ICRC) explicitly cautioned that chatbots “may generate incorrect or fabricated archival references,” noting that LLMs can invent catalogue numbers, document descriptions and even platforms that never existed.This phenomenon — often called citation fabrication or bibliographic hallucination — is not merely anecdotal. Controlled experiments and multi‑market audits show non‑trivial rates of fabrication in LLM outputs, and institutions reliant on archival accuracy report growing operational burdens verifying and disproving AI‑produced leads.
Why this matters now
For professionals who work with verifiable records — archivists, librarians, researchers, journalists, lawyers and IT managers — the stakes are tangible:- Operational cost: Archivists and reference desks are spending staff hours chasing unverifiable leads flagged by AI, diverting time from legitimate requests.
- Scholarly integrity: Fabricated citations can slip into student papers, preprints and literature reviews, polluting the scholarly record and misdirecting future research.
- Legal risk: Courts have already punished litigants and lawyers who submitted filings citing non‑existent authorities generated by AI, illustrating how hallucinations translate into professional and financial liability.
- Editorial harm: Newsrooms and publishers face increased fact‑checking costs; editorials or briefs using unverified AI outputs risk reputational damage.
The evidence: what audits and studies actually found
Peer‑reviewed experiment: citation fabrication in literature reviews
A controlled study prompting GPT‑4o to generate literature reviews in mental‑health topics extracted 176 bibliographic citations and attempted to verify each entry across major scholarly indexes. The authors found that 35 citations (19.9%) were fabricated — no identifiable source — and that many of the remaining citations contained bibliographic errors (invalid DOIs, wrong pages, incorrect dates). The study concluded that fabrication rates are higher for less visible or highly specialized topics.Newsroom audit: multi‑market testing of assistants
An audit coordinated by major public broadcasters (including the EBU and BBC) put several mainstream assistants through news‑style prompts and found roughly 45% of replies contained at least one significant issue; sourcing failures and temporal staleness were primary contributors. The audit’s editorial realism — professional journalists and blind review — underscores that assistants cannot be treated as single‑step replacements for newsroom verification.Institutional reports and operational observations
Archives and state libraries are reporting rising volumes of reference queries that appear to originate from AI tools; one widely reported figure cited a state archive estimating about 15% of emailed reference questions originated from ChatGPT. That number is directional and reported in journalistic accounts rather than as a national census, but it reflects a clear operational trend: archivists increasingly triage “phantom” queries that require disproving non‑existence.Why LLMs invent sources: a technical primer
LLMs are statistical sequence predictors optimized to maximize fluent and contextually appropriate continuations. Two design properties explain bibliographic hallucinations:- Generative objective and “always‑answer” bias: Models are typically tuned to be helpful and avoid refusals. When prompted for source‑level facts and no reliable grounding exists, the model fills the gap with plausible strings (author names, journal formats, DOIs, catalogue numbers) rather than saying “I don’t know.” This makes fabrication a structural behavior rather than an isolated bug.
- Retrieval vs. generation mismatch: Many assistants either lack a robust retrieval layer or attach provenance post‑hoc. When retrieval is weak or stale, the generator still synthesizes an authoritative answer; if citation metadata is assembled after generation, systems can produce ceremonial citations that do not support the claim. Retrieval‑augmented generation (RAG) reduces but does not eliminate the risk, since retrieval quality and index freshness determine grounding fidelity.
- Errors — mistaken facts about real items (e.g., wrong year).
- Hallucinations — confidently asserted items that do not exist at all (e.g., made‑up journal names).
Real‑world impacts and documented harms
Archives and libraries: the “prove‑negative” burden
A fabricated archival citation that points to a “unique primary source” forces archivists to search accession registers, finding aids and uncatalogued collections — time‑consuming tasks with no guaranteed closure. Institutions are reacting by requiring AI‑use disclosure for reference requests, asking for prompts or raw outputs, and setting time budgets for verification.Academia: contamination of the scholarly record
Students and scholars using AI without verification can submit work with fabricated bibliographies. Peer reviewers who accept unverified references risk amplifying false claims. Several journals and universities are already recommending that AI‑generated citations must be verified and logged prior to submission.Law and compliance: sanctions and malpractice risk
Courts have fined lawyers and admonished counsel who submitted briefs referencing nonexistent cases produced by chatbots. The legal sector shows how AI hallucinations can carry immediate, enforceable consequences — a cautionary tale for regulated professions where citation fidelity is non‑negotiable.Newsrooms and public information
Audit results show assistants can misattribute quotes, invert guidance, or present outdated facts as current — errors that can mislead audiences at scale if adopted uncritically into editorial workflows. The EBU/BBC testing highlights how assistant outputs often require newsroom‑level verification before publication.Vendor responses and product design tradeoffs
Vendors differ in how they integrate retrieval, provenance and enterprise controls. Key product choices affect hallucination risk:- Retrieval‑first vs. post‑hoc citation assembly: Systems that condition generation on retrieved, timestamped documents and attach provenance materially reduce fabrication risk. Post‑hoc approaches that "invent" citations after drafting are more vulnerable.
- Provenance exposure: Tools that surface machine‑readable provenance (retrieved snippets, DOIs, WorldCat links) make it easier for users to verify claims; omission of provenance promotes blind trust.
- Enterprise grounding: Copilot‑style products that ground answers in tenant data (Graph, Purview, internal corpora) can be more reliable for organization‑specific queries, but grounding only solves part of the problem if public‑facing reference queries still rely on web retrieval.
Practical guidance: a playbook for researchers, IT teams and archives
The single most important rule is straightforward: never accept an AI‑generated citation at face value. Treat model outputs as leads, not evidence. Below is a compact, actionable playbook.For individual researchers and students
- Verify every AI‑suggested citation against CrossRef, PubMed, Scopus, WorldCat or publisher pages before inclusion.
- Use AI to generate search terms, not final bibliographies; record prompts, model version and timestamps for auditability.
- If a DOI or identifier is provided, resolve it immediately; invalid or mis‑resolved DOIs are a common sign of fabrication.
For librarians and archivists
- Require requesters who present AI‑generated citations to include the raw prompt and model output. Set a published time budget for verification and triage unverifiable claims accordingly.
- Pilot internal RAG tools and automated DOI/metadata checkers to accelerate triage.
For universities, journals and publishers
- Mandate that any AI‑aided references be human‑verified prior to submission; consider automated DOI resolution gates in submission systems.
- Require authors to document AI use (prompts, model version) and to supply verification logs if requested.
For IT managers and enterprise teams (especially in Windows / Microsoft 365 environments)
- Choose enterprise plans that include non‑training assurances, data residency controls, and tenant grounding features for Copilot where available.
- Implement human‑in‑the‑loop sign‑off for any public communication, legal filing, or regulatory document that relies on AI outputs.
- Integrate automated DOI/CrossRef resolution and provenance capture into downstream workflows (document generation, reporting, knowledge bases).
Technical mitigations developers should adopt
Product and engineering teams can materially reduce hallucination risk by defaulting to conservative, retrieval‑first designs:- RAG by default for factual queries: Make retrieval the default and require the model to cite exact retrieved passages and identifiers.
- Machine‑readable provenance: Attach provenance metadata (source snippet, URL/DOI, timestamp, retrieval confidence) to each factual claim.
- Built‑in verification APIs: Before presenting citations, resolve DOIs and cross‑check metadata against CrossRef, WorldCat, PubMed or publisher APIs.
- Conservative refusal modes: For high‑risk queries (legal, medical, archival uniqueness claims), default to refusal or to returning a verifiable list of candidate sources rather than invented citations.
Governance, procurement and policy implications
The ICRC advisory is a policy signal that should push institutions to update procurement and governance:- Procurement contracts should require verifiability guarantees, prompt/response retention for audit, and the ability to verify vendor claims about provenance.
- Professional bodies (bar associations, editorial boards, medical boards) should clarify that AI usage does not shift responsibility for accuracy: practitioners remain accountable. Courts and regulatory bodies are already enforcing this principle in some jurisdictions.
- Institutions should publish clear AI‑use policies for research and reference services that include verification checklists and acceptable time budgets for invoiceable or recoverable verification labor.
Strengths and controlled uses: where AI still helps
Despite the risks, generative AI remains a powerful tool when employed with controls:- Rapid synthesis: Models accelerate initial literature scans and surface keywords, related authors and search paths that can save human researchers time.
- Accessibility: AI can help non‑specialists get quick orientations to complex topics and draft clearer prose.
- Discovery assistance: Even when citations are imperfect, the model’s suggested search terms and topical structure can direct human investigators to legitimate sources.
What remains uncertain — and what to watch
Key open questions and early signals to monitor:- Prevalence variability by domain and model version: Fabrication rates vary with model version, prompts, and domain familiarity; ongoing audits and fresh studies are necessary to track improvement. The Deakin/JMIR experiment and the EBU/BBC audit are snapshots that show the problem persists as of their test windows.
- Vendor upgrades and defaults: Watch vendor rollouts that attach machine‑readable provenance and implement DOI/metadata verification gates. Broad adoption would reduce the verification burden dramatically.
- Institutional policy adoption: If publishers, funders, and bar associations require verification logs and AI‑use disclosures, the operational incentives will align toward safer usage.
Conclusion
The ICRC’s warning is an operational alarm, not a speculative headline: mainstream generative assistants are producing authoritative‑sounding yet fabricated research records at rates high enough to impose real costs on archives, legal teams, publishers and academics. Peer‑reviewed experiments and multi‑market audits corroborate the basic diagnosis: fluency is not the same as veracity.The path forward is practical and layered. Vendors must make retrieval‑first designs, provenance and verification APIs the default for source‑level queries; institutions must require human verification, update procurement contracts and adopt transparent AI‑use policies; and individual researchers, students, librarians and IT managers must treat AI outputs as leads to be verified, not as authoritative sources. Where those disciplines converge, AI can deliver its productivity benefits without eroding the foundations of research, law and public information.
The operational takeaways are clear: verify every citation, demand provenance, and treat AI as an assistant — not a substitute — for primary‑source research.
Source: NDTV Profit ChatGPT, Gemini, Copilot, Others Generating Research Papers, Journals That Dont Exist: Red Cross