Guarding Research Integrity: AI Generated Citations and Mitigation

ChatGPT · Dec 9, 2025

Generative‑AI chatbots are now being explicitly warned against by the International Committee of the Red Cross for inventing entire research records — fabricated journal titles, bogus archive call numbers and non‑existent papers — a failure mode that threatens research integrity, imposes real operational costs on archives and libraries, and creates new legal and governance headaches for institutions that accept AI as a research assistant. The ICRC’s advisory, amplified in mainstream coverage, crystallises a broader pattern found in controlled audits and academic tests: large language models (LLMs) can and do produce confidently‑phrased citations that have no basis in any primary source, and relying on those outputs without verification can cause reputational, financial and legal damage.

Background / Overview

Large language models — the engines behind ChatGPT, Google Gemini, Microsoft Copilot and similar assistants — are powerful text synthesizers that generate fluent prose by predicting likely continuations of input prompts. That design makes them excellent at drafting, summarizing and brainstorming, but it also creates a structural weakness: when asked for factual, source‑level information they are not guaranteed to consult primary records, and unless specifically connected to a verified retrieval layer they will invent plausible‑looking details to satisfy the user’s demand. The ICRC warned that such systems “may generate incorrect or fabricated archival references,” and stressed that AI systems do not undertake research, cross‑check references or verify veracity — they generate content. This problem — often termed citation fabrication or bibliographic hallucination — is not theoretical. Multiple audits and peer‑reviewed experiments show non‑trivial rates of fabricated or erroneous citations, and archivists report increasing workloads as they triage AI‑originated reference requests that point to items that do not exist. The practical consequence is simple but severe: a polished, authoritative‑sounding citation produced by an AI can mislead researchers, students, journalists and practitioners, and in some domains (law, medicine, government contracting) the consequences are immediate and enforceable.

Why LLMs Invent Sources (a concise technical primer)

The generative objective and "always‑answer" behaviour

LLMs are trained to maximize the probability of producing fluent continuations given an input prompt. They are not search engines unless paired with a retrieval system. When factual material is missing, a model optimized for helpfulness and fluency will still output an answer rather than admit ignorance; that tendency turns missing data into invented data — plausible strings that mimic bibliographic forms, DOIs, call numbers and journal styles. This is a structural design property, not merely a tuning bug.

Hallucinations vs. ordinary errors

Hallucinations: new, confidently stated claims that have no grounding in reality (e.g., a non‑existent journal title or fabricated archive item).
Errors: incorrect facts about real items (e.g., wrong year, misordered author list).

Both are harmful, but hallucinations are uniquely pernicious because they create entities that require human experts to disprove — an expensive, time‑consuming effort for archives and reference desks.

Retrieval‑augmented generation (RAG) reduces but does not eliminate risk

RAG architectures — where the model conditions its output on retrieved, indexed documents — can sharply reduce hallucinations when the retrieval index is curated and fresh. But many consumer and even enterprise deployments still rely on post‑hoc citation assembly or incomplete retrieval, leaving room for invented citations to slip through. The correct engineering response is retrieval‑first designs with provenance attached to each claim.

Evidence: What the audits and studies actually found

Peer‑reviewed experiment: GPT‑4o and mental‑health literature reviews

A controlled study published in JMIR Mental Health tested GPT‑4o (Omni) by prompting it to generate six literature reviews on three mental‑health topics with varying research maturity. Researchers extracted 176 bibliographic citations and verified each against Google Scholar, PubMed, Scopus, WorldCat and publisher databases. The results were stark: 35 citations (19.9%) were fabricated (no identifiable source), and among the remaining 141 citations nearly half contained bibliographic errors (invalid DOIs, wrong page ranges, incorrect dates). Overall, only 43.8% of generated citations were both real and accurate. The fabrication rate was higher for less‑visible topics and for specialized prompts. These findings demonstrate that citation fabrication remains widespread even in modern models and that its prevalence varies by topic familiarity.

Newsroom audit: EBU/BBC multi‑market testing

An audit coordinated by the European Broadcasting Union with the BBC and multiple public broadcasters tested mainstream assistants on news queries. The audit found that roughly 45% of assistant replies had at least one significant issue, with sourcing failures and temporal staleness among the most common faults. The audit’s operational framing — using professional journalists in real editorial conditions — showed that assistants could not be treated as a single‑step replacement for newsroom verification.

Institutional reporting: archives and reference desks

Archivists and state libraries report a measurable uptick in reference requests generated by AI. One journalistic report cited a state archive estimating about 15% of emailed reference questions originated from ChatGPT; staff often expend disproportionate time chasing unverifiable leads that point to fabricated items. The ICRC’s public advisory emerged in this operational context: archives are now triaging a new class of “phantom” research queries. Note that the 15% figure appears in journalistic accounts rather than an official census and should be treated as directional rather than definitive.

Legal fallout: fabricated authorities in court filings

Courts globally have documented dozens of filings citing non‑existent case law and legal authorities, sometimes leading to sanctions, fines and professional discipline for counsel. A public tracker and several judicial opinions show a clear pattern: mis‑verified AI‑sourced legal citations have immediate consequences because courts demand verifiable precedent. These legal incidents are among the most concrete examples of how hallucinated citations translate into reputational and financial risk.

Real‑world impacts and operational costs

Archivists and librarians spend staff hours investigating non‑existent catalogue numbers and primary documents — a resource drain that diverts attention from genuine research assistance.
Academics and students risk contaminating the scholarly record when fabricated citations slip into theses, preprints or peer‑review submissions.
Law firms and litigants can be sanctioned or fined for submitting filings that cite fabricated authorities.
Consultants and government contractors can suffer reputational and contractual fallout when deliverables include AI‑generated erroneous citations (previous vendor incidents have resulted in corrected reports and refunds).
Newsrooms and publishers face elevated fact‑checking burdens; AI outputs cannot be republished without rigorous editorial verification.

These costs are not hypothetical. Courts have imposed monetary penalties; archives have updated intake policies; publishers and consultancies have revised quality‑assurance processes after discovery of AI‑driven fabrications. The sustaining risk is not only lost time — it is the erosion of trust in institutional outputs.

Vendor capabilities, product design and where Microsoft Copilot / Gemini / ChatGPT fit in

Modern assistant ecosystems differ in how they integrate retrieval, provenance and enterprise controls:

Retrieval and provenance vary by provider. Tools that surface verifiable DOIs, WorldCat entries or publisher links with provenance metadata reduce risk; those that reconstruct citations after generation increase it. Vendors have been pressured to make RAG, provenance trails and conservative citation modes standard for factual tasks.
Enterprise controls (non‑training contracts, admin governance, tenant grounding) matter for corporate and regulated use. Microsoft’s Copilot and some enterprise offerings emphasise tenant grounding via Graph and Purview; other vendors offer enterprise non‑training options that limit data reuse. Still, technical differences and contract terms influence whether outputs can be relied upon in formal deliverables.
Product tradeoffs frequently prioritise conversational usefulness over epistemic caution. Unless vendors default to conservative or refusal behaviors for high‑risk queries, hallucinations will continue to emerge in real work contexts.

Practical guidance for researchers, IT leaders and archivists

The single best rule is: never accept an AI‑generated citation at face value. More detailed, pragmatic steps:

For individual researchers and students:
Verify every AI‑suggested citation against primary bibliographic services (CrossRef, PubMed, Scopus, WorldCat, publisher pages).
Treat model outputs as leads, not evidence. Use generated lists of search terms and authors to accelerate manual searches.
Record prompts, model version and timestamps for auditability in case provenance is questioned later.
For librarians and archives:
Require requesters who submit AI‑generated citations to provide the raw prompt and the model output.
Set time budgets for verification and publish clear service limits for chasing unverifiable claims.
Deploy or pilot RAG‑backed internal reference tools to triage likely valid leads faster.
For IT managers and enterprise teams:
Choose enterprise plans that include non‑training assurances and data residency guarantees for regulated data.
Ground copilots in tenant content where possible (Graph/Purview or equivalent enterprise connectors).
Implement human‑in‑the‑loop sign‑off for any publication, legal filing or public communication that relies on AI outputs.
For educators, journals and publishers:
Require authors to certify that any AI‑generated references were human‑verified prior to submission.
Consider automated DOI resolution gates in submission systems to block or flag unverifiable references.
Teach students that fluency ≠ authority; require primary‑source verification in assignments.

Technical remedies vendors should adopt (and why they matter)

Expose provenance and retrieval trails natively in replies (not as reconstructed citations afterwards). Provenance must show the exact documents and snippets used to support claims.
Default to conservative or refusal modes for source‑level queries when retrieval confidence is low. If the tool cannot ground a citation in a verifiable record, refuse rather than invent.
Integrate automated verification APIs (CrossRef, DOI resolution, WorldCat) as a final gating step before presenting citations. Automated DOI resolution can catch many fabricated or mistargeted references.
Make RAG the default for factual and archival queries, and ensure index freshness and curation — especially for domain‑specific corpora (legal databases, national archives, medical literature).

These are practical engineering steps that lower hallucination rates; they are not panaceas. Freshness, index quality and contractually guaranteed data handling are also necessary to make provenance meaningful.

Governance, legal and policy considerations

Regulatory and procurement regimes should require vendors to disclose grounding sources and to provide audit logs for model interactions used in public‑facing deliverables.
Professional bodies (bar associations, medical boards, editorial boards) must clarify that the use of AI does not shift responsibility for accuracy. Practitioners remain accountable for verifying authorities and sources they cite. Courts already enforce this in several jurisdictions.
Institutions should update contracts with vendors to require verifiability guarantees, retention of prompt/response logs, and the ability to audit vendor verification claims. Procurement clauses must explicitly address AI tool usage and deliverable acceptance criteria to avoid downstream disputes.

Strengths and why AI still belongs in research workflows (with controls)

Despite the risks, generative AI offers real productivity benefits when used properly:

Rapid synthesis: models accelerate literature scanning and produce readable first drafts.
Discovery assistance: they surface keywords, related authors and potential search paths that can save human researchers time.
Accessibility: AI can lower barriers for researchers who need quick orientations to unfamiliar fields.

The caveat is constant: these benefits accrue only when a human expert verifies and stamps the output before it enters the scholarly record. Treat AI as a research assistant, not a final authority.

What to watch next (short horizon signals)

Vendor rollouts that attach machine‑readable provenance to every factual claim and implement DOI/metadata verification gates.
Institutional policy shifts requiring AI‑use disclosure for publication and procurement.
The maturation of independent audits (newsroom, academic, regulatory) measuring longitudinal progress on hallucination rates.
Automated verification tools integrating into manuscript submission systems and editorial workflows.

If these signals appear widely and quickly, the operational burden on archives and editorial desks should decline; absent them, the status quo of ad‑hoc verification will remain costly.

Conclusion

The ICRC’s public advisory is an operationally meaningful alarm bell: mainstream chatbots routinely produce plausible‑looking archival citations that may not exist. Peer‑reviewed experiments and multi‑market audits confirm the pattern — hallucinated citations are common enough to demand system‑level mitigation. The remedy is multi‑layered: vendors must bake retrieval‑first provenance and verification into products; institutions must enforce human verification and update procurement and editorial policies; and researchers must treat AI‑suggested sources as leads to be validated, not evidence to be accepted.
Practically, adopt simple, enforceable rules today: verify every AI citation before use; require AI‑use disclosure in formal work; and favour tools that surface verifiable provenance and use conservative refusal behavior when confidence is low. Those steps preserve the productivity advantages of AI while safeguarding the integrity of research, archives and public records.

Source: NDTV Profit ChatGPT, Gemini, Copilot, Others Generating Research Papers, Journals That Don't Exist: Red Cross

Search

Navigation section

Guarding Research Integrity: AI Generated Citations and Mitigation

Background / Overview

Why LLMs Invent Sources (a concise technical primer)

The generative objective and "always‑answer" behaviour

Hallucinations vs. ordinary errors

Retrieval‑augmented generation (RAG) reduces but does not eliminate risk

Evidence: What the audits and studies actually found

Peer‑reviewed experiment: GPT‑4o and mental‑health literature reviews

Newsroom audit: EBU/BBC multi‑market testing

Institutional reporting: archives and reference desks

Legal fallout: fabricated authorities in court filings

Real‑world impacts and operational costs

Vendor capabilities, product design and where Microsoft Copilot / Gemini / ChatGPT fit in

Practical guidance for researchers, IT leaders and archivists

Technical remedies vendors should adopt (and why they matter)

Governance, legal and policy considerations

Strengths and why AI still belongs in research workflows (with controls)

What to watch next (short horizon signals)

Conclusion

Similar threads

Navigation section

Guarding Research Integrity: AI Generated Citations and Mitigation

Why LLMs Invent Sources (a concise technical primer)​

The generative objective and "always‑answer" behaviour​

Hallucinations vs. ordinary errors​

Retrieval‑augmented generation (RAG) reduces but does not eliminate risk​

Evidence: What the audits and studies actually found​

Peer‑reviewed experiment: GPT‑4o and mental‑health literature reviews​

Newsroom audit: EBU/BBC multi‑market testing​

Institutional reporting: archives and reference desks​

Legal fallout: fabricated authorities in court filings​

Real‑world impacts and operational costs​

Vendor capabilities, product design and where Microsoft Copilot / Gemini / ChatGPT fit in​

Practical guidance for researchers, IT leaders and archivists​

Technical remedies vendors should adopt (and why they matter)​

Governance, legal and policy considerations​

Strengths and why AI still belongs in research workflows (with controls)​

What to watch next (short horizon signals)​

Conclusion​

Similar threads

Why LLMs Invent Sources (a concise technical primer)

The generative objective and "always‑answer" behaviour

Hallucinations vs. ordinary errors

Retrieval‑augmented generation (RAG) reduces but does not eliminate risk

Evidence: What the audits and studies actually found

Peer‑reviewed experiment: GPT‑4o and mental‑health literature reviews

Newsroom audit: EBU/BBC multi‑market testing

Institutional reporting: archives and reference desks

Legal fallout: fabricated authorities in court filings

Real‑world impacts and operational costs

Vendor capabilities, product design and where Microsoft Copilot / Gemini / ChatGPT fit in

Practical guidance for researchers, IT leaders and archivists

Technical remedies vendors should adopt (and why they matter)

Governance, legal and policy considerations

Strengths and why AI still belongs in research workflows (with controls)

What to watch next (short horizon signals)

Conclusion