AI Hallucination in Policing: Copilot Led Maccabi Ban and FOI Fight

  • Thread Author
The decision to bar Maccabi Tel Aviv supporters from an Aston Villa Europa League match in Birmingham last November has detonated into a test case for how police forces use—and scrutinise—the outputs of generative AI, with a fabricated football fixture traced to Microsoft Copilot and subsequent legal and transparency fights over the AI chat logs that helped shape operational decisions. (theverge.com)

Confidential documents glow with a Copilot logo beside a magnifying glass and classified tags.Background​

The Safety Advisory Group (SAG) for Birmingham recommended that Maccabi Tel Aviv’s travelling supporters should not attend the match on 6 November 2025. That recommendation relied in part on an intelligence pack compiled by West Midlands Police (WMP). Subsequent inspection and parliamentary scrutiny revealed multiple inaccuracies in that pack, chief among them a citation of a non-existent West Ham v Maccabi Tel Aviv fixture — an item later linked to output produced by Microsoft Copilot. The discovery shifted the story from disputed tactics to a national controversy about AI, evidence, community trust and accountability.
His Majesty’s Inspectorate of Constabulary’s review and parliamentary hearings exposed problems ranging from confirmation bias and weak provenance to leadership failings. The Home Secretary publicly stated she had “no confidence” in the force’s chief constable, prompting further inquiries and a referral to the Independent Office for Police Conduct (IOPC).

What happened, in plain terms​

  • An analyst or officer used Microsoft Copilot as part of open‑source research into past incidents involving Maccabi fans.
  • Copilot producausible‑sounding but false reference to a West Ham–Maccabi fixture.
  • That assertion was included in the intelligence pack presented to the SAG.
  • The SAG recommended excluding travelling fans; the match went ahead without away fans.
  • Later, journalists, parliamentarians and inspectors found the fictitious match, traced it back to Copilot, and concluded the dossier contained a catalogue of overstated or unverified claims. (theverge.com)
This sequence shows a classic pattern: an AI hallucination migrated into operational decision‑making because human verification and documentation failed. The operational consequences — a restriction on movement for a group of supporters, community harm, and political fallout — were therefore concrete, not hypothetical.

Why the chat logs matter: public interest, tragation​

At the centre of the current legal tussle are the Copilot prompts and the responses that produced the fabricated item. Those chat logs are now potentially evidentiary: they show how the intelligence was built, who asked what, and whether the AI output was treated as corroborated fact.
A journalist who used Freedom of Information (FOI) procedures to request the prompts and the Copilot responses says West Midlands Police refused, invoking the law‑enforcement exemption — specifically Section 31(1)(g) of the UK Freedom of Information Act — arguing that disclosure would, or would be likely to, prejudice investigations. The force’s reliance on Section 31 effectively blocks public access to the AI prompts while the inspectorate and the IOPC examine the case.
That refusal is legally plausible but politically charged. In theory, the Section 31 exemption isc Interest Test (PIT)**: authorities must balance the public interest in disclosure (transparency, learning lessons about AI in policing) against the public interest in preserving the integrity of investigations. Where a statutory watchdog such as the IOPC is actively investigating, disclosure can be judged likely to prejudice those enquiries — and so the PIT can fail while the investigation is live. The journalist reports they confirmed with the IOPC press office that the watchdog intends to examine “the use of AI, including the prompts used and responses generated,” which makes an immediate successful PIT challenge unlikely. That same contact suggested the logs will be important evidence for any subsequent misconduct or systems‑level review.

The actors: Microsoft, West Midlands Police and the IOPC​

Microsoft Copilot and the company response​

Microsoft publicly stated — in comments reported by technology press — that it was unable to reproduce the behaviour being described and emphasised that Copilot surfaces content from multiple web sources and displays linked citations; Microsoft’s spokesman urged users to review Copilot’s cited sources. In short, Microsoft framed this as a usage and verification problem rather than a product defect that it could demonstrate on demand. (theverge.com)
That statement is technically accurate but incomplete from a governance perspective. Copilot and similar assistants combine retrieval mechanisms, web index content, and probabilistic language models: even where a product produces citations, the synthesis process and retrieval pipeline mean some outputs can still be misleading, mis‑ordered or ungrounded if the human operator does not verify the primary sources. The feature that surfaces citations does not eliminate the need for strict human verification in rights‑affecting decision‑making.

West Midlands Police​

WMP initially gave different explanations in parliamentary testimony: first attributing the error to “social media scraping,” then to a Google search, and only later acknowledging Copilot’s role when internal review made the provenance clear. That sequence — inaccurate characterisation of the source followed by correction — is a major part of the accountability problem: it increased the perception that decision‑makers were attempting to minimise or misrepresent the role of AI in the dossier. (theverge.com)
Following the inspectorate’s review and sustained political pressure, senior personnel changes and referrals followed. The force suspended Copilot use “until further notice” and announced internal changes including bespoke training and reviews of intelligence‑handling pathways.

The Independent Office for Police Conduct (IOPC)​

The IOPC is the statutory watchdog for policing conduct and has been briefed on this matter; the IOPC referral and involvement mean that any FOI release that might interfere with their fact‑finding could be blocked while they investigate. The IOPC’s remit allows it to consider procedural and misconduct elements and to request or seize evither officers’ actions met professional standards.

The legal mechanics: FOI, Section 31 and the Public Interest Test​

Freedom of Information law in the UK contains a well‑worn exemption at Section 31 for law enforcement, intended to protect ongoing investigations and the effectiveness of policing and prosecutorial functions. When a public authority refuses an FOI request on Section 31 grounds, it must apply a Public Interest Test: decide whether the public interest in disclosure outweighs the interest in maintaining the exemption.
Practical realities highlighted by this case:
  • While an investigation is active, the threshold for disclosure is high: revealing operational material could prejudice lines of enquiry, tip off witnesses or compromise evidence handling.
  • Once the IOPC and other inquiries conclude, the balance can shift decisively toward disclosure: the public interest in transparency and learni stronger when the risk of prejudicing enforcement has passed.
  • There is therefore a narrow temporal window where FOI litigation or a strong PIT argument could win: typically after investigations finish. Attempting to compel disclosure mid‑investigation is legally uphill.
This explains the journalist’s calculus: the refusal may be defensible now, but the same logs will be far more likely to be released and to be ethically imperative once investigations and any misconduct proceedings are complete.

Why the logs matter beyond curiosity​

The prompts and responses are not merely a curiosity for tech reporters. They are potentially a blueprint of how operational intelligence was prepared:
  • They show the exact prompts issued (which indicate what question the analyst asked).
  • They show the resulting text Copilot returned (the piece that allegedly contained the fabricated fixture).
  • They may show whether the analyst saved or edited the output, what follow‑up queries were run, and how the output migrated into the formal intelligence product.
That chain reveals whether the AI output was treated as evidence rather than a lead to be verified, and whether the force’s processes required corroboration before using AI‑derived material in high‑stakes decisions. For future audits, training, procurement and policy, those are substantive lessons.

Technical explanation: why Copilot can “hallucinate”​

Generative assistants like Copilot combine retrieval‑augmented generation, indexed web content and large language models. Even when a system returns citations, three practical failure modes arise:
  • Retrieval failure: the RAG layer might return weak or irrelevant documents, or the indexed web content may itself be inaccurate; the model synthesises language that appears coherent but is not grounded in primary evidence.
  • Synthesis/hallucination: the language model can invent plausible details to bridge gaps, especially for prompts that ask for historical or timeline specifics.
  • Human process failure: an unverified AI output is treated as an authoritative claim and included in a product without clear provenance or corroboration.
This is not a bug unique to Microsoft Copilot; it’s an architectural risk of probabilistic generation + incomplete human oversight. Systems mitigate this by adding provenance metadata, forcing citations to link to primary documents and creating human‑in‑the‑loop gating before outputs are used for decisions affecting rights or safety. But those mitigations are only effective when consistently applied in practice.

Policy and governance implications — immediate, medium and long term​

Immediate steps public organisations should take (0–3 months)​

  • Freeze any reuse of generative AI tools for producing or compiling intelligence that will inform rights‑affecting multi‑agency decisions until strict verification rules are in place. WMP has already suspended Copilot pending review.
  • Mandate provenance fields: every factual claim included in intelligence products should be accompanied by a provenance line (who produced it, what tool generated it, primary sources that corroborate it).
  • Retrospective audits: review recent high‑impact decisions to find whether AI assisted research was used and whether that played a material role.

Medium term (3–12 months)​

  • Procurement standards for vendor AI: contracts should require auditable logs, retention policies, and demonstrable provenance and red‑team testing from suppliers.
  • Training and certification: analysts and commanders must be trained in “prompt scepticism” and in treating AI outputs as leads, not facts.
  • Process re‑engineering: insert mandatory independent verification steps before AI‑sourced claims enter multi‑agency governance forums like SAGs.

Strategic (12+ months)​

  • Regulatory frameworks: sectoral rules for “AI in public safety” that require auditable provenance, retention of logs (with appropriate safeguards) and oversight by independent bodies.
  • Technical improvements: invest in tooling that tightly couples rele source links and confidence scores, and that prevents copy‑paste of AI‑generated assertions into official documents without flagging and human check.
  • Public reporting: publish redacted post‑mortems of AI‑affected decisions to restore public trust and create usable case studies.

What the press and the company have said — verified facts​

  • West Midlands Police acknowledged a Copilot‑generated error and its chief constable later apologised to MPs after earlier mischaracterisations. (theverge.com)
  • Microsoft told journalists it could not reproduce the specific hallucination reported and reiterated that Copilot provides linked citations and encourages users to review sources. (theverge.com)
  • The inspectorate’s review and parliamentary scrutiny led to fierce political criticism and personnel consequences at WMP, and the matter was referred to the IOPC for possible misconduct examination.
  • WMP suspended Copilot use and announced organisational reviews and training following the incident.
These are load‑bearing facts that are independently reported by multiple outlets and reflected in inspection summaries. (theverge.com)

The legal tightrope: can the FOI denial be challenged?​

Yes — but timing matters.
  • While the IOPC investigation and any linked inquiries are active, courts and information commissioners typically give significant weight to the risk of prejudicing enforcement processes. That is why WMP’s invocation of Section 31 and the denial of the FOI request is legally defensible while investigations are live.
  • After the IOPC completes its work (and any misconduct proceedings are resolved), the public interest calculus shifts sharply toward disclosure. At that point the FOI requester would have stronger grounds for a PIT challenge and, if refused again, for an appeal to the Information Commissioner and possibly judicial review.
  • Practically, a phased approach is also possible: courts or regulators can oversee redacted or tightly controlled disclosure (e.g., limited to independent watchdogs or under protective orders) to preserve investigative integrity while satisfying transparency concerns.
In short: an immediate public‑interest victory is unlikely while the IOPC has an active remit; the best opportunity for release—or for litigating in the public interest—will come after the watchdog’s processes conclude.

Risks and trade‑offs: transparency versus investigative integrity​

  • The public interest in transparency is high: the logs would teach lawyers, technologists and the public how AI contributed to a decision that restricted people’s rights and harmed community trust.
  • The public interest in effective investigations is also high: premature disclosure can compromise witness cooperation, reveal investigatory technique, and impede fact‑finding.
  • There is a third, cumulative risk: policy capture and complacency. If agencies successfully withhold AI usage details for long periods, lessons about safe deployments may not be learned system‑wide and similar errors will recur.
Policymakers should therefore prefer structured transparency — protective disclosures to watchdogs, timed public release of redacted records post‑investigation, and mandated auditing by independent reviewers. That approach balances the competing interests while creating learning opportunities.

What tech teams and IT leaders inside public services should do now​

  • Inventory all generative AI deployments and usage patterns across the organisation.
  • Immediately stop using generative assistants to compile material that will be used unverified in rights‑affecting decisions.
  • Implement mandatory provenance metadata and retention for all outputs used in decision chains.
  • Require analysts to archive prompts, model responses and the follow‑up verification steps as part of an auditable workflow.
  • Build a “red‑team” review process that simulates hallucination scenarios to assess how the organisation would detect and contain fabricated outputs before they cause harm.
These steps are operationally achievable and materially reduce the chance that an AI hallucination becomes policy.

Final analysis: systemic lessons and the road ahead​

The WMP/Copilot episode is a painful but clear case study: generative AI can accelerate research and produce helpful syntheses, but it is fundamentally probabilistic, and its outputs must never be mistaken for verified evidence without corroboration. The real failure here is not merely a product hallucination; it is a governance failure — weak processes, poor documentation, and leadership that did not ensure auditable verification before operational decisions were made.
Three final, pragmatic takeaways:
  • Technical fixes alone are insufficient. Vendor controls (better provenance, explainability) help, but organisations must embed processes that treat AI outputs as hypotheses requiring human verification.
  • Law will lag events. FOI rules, watchdog mandates and procurement law create windows where information is legitimately withheld to protect investigations, but policy must require release and redress after those processes finish.
  • Transparency is a public good when timed correctly. The AI prompts and responses will be invaluable to auditors, regulators, and civic society — but the timing and the manner of disclosure must be managed to protect the integrity of ongoing statutory investigations.
The core lesson for any IT, security, or procurement leader is simple: adopt a presumption of scepticism toward AI‑generated assertions in high‑stakes processes and require auditable provenance and verification as mandatory controls. The WMP case will be dissected for months — and the prompts and responses, if and when they become public, will be central to understanding not only what went wrong, but how to make sure it does not happen again. (theverge.com)

Conclusion
The controversy around West Midlands Police’s use of Microsoft Copilot is a watershed for public‑sector AI governance. It exposes the practical harms that flow when probabilistic systems are allowed to feed into decisions that affect civil liberties, and it underlines the necessity of operational discipline, vendor accountability and carefully designed transparency pathways. The immediate legal battle over the FOI request is only one front in a much larger institutional challenge: rebuilding trust through demonstrable safeguards, independent audit, and a public record of lessons learned once statutory investigations permit disclosure. (theverge.com)

Source: sUAS News Computer Says No: The Legal Battle for the WMP’s Secret AI Chat Logs
 

Back
Top