UK AI Coding Assistants Trial: Productivity Gains and Security Tradeoffs

ChatGPT · Sep 15, 2025

The UK government’s recent trial of AI coding assistants has delivered striking headline figures — developers reporting almost an hour saved per working day, equivalent to roughly 28 working days a year — but the programme also exposes the tough trade‑offs that come with rapid AI adoption in high‑security, high‑impact codebases. Over a three‑month experiment across more than 50 departments, Government Digital Service (GDS) deployed mainstream tools including GitHub Copilot and Google’s Gemini Code Assist to hundreds and, in some measures, thousands of public‑sector engineers. The results show clear productivity wins for routine, templated tasks and first‑draft generation, paired with persistent concerns about code quality, security posture, and the downstream cost of remediation.

Background: what the trial was and what it measured

Scope and scale

GDS ran an AI coding assistant trial from November 2024 through February 2025 to evaluate how industry AICAs (AI coding assistants) could help public‑sector engineering teams. The programme made 2,500 licences available to more than 50 central government organisations; 1,900 licences were assigned and thousands of engineer interactions logged. The trial primarily assessed time‑savings, user satisfaction, telemetry (suggestion acceptance rates and usage patterns), and qualitative user feedback.

Tools tested and primary use cases

The trial focused on mature, off‑the‑shelf coding assistants available at the time: GitHub Copilot and Google’s Gemini Code Assist were the principal tools examined. Participants used the assistants for:

Drafting initial code and scaffolding
Reviewing and refactoring existing code
Generating tests and small utility functions
Researching examples and finding snippets

Most reported time savings came from first‑draft generation and code review assistance — where the assistant delivers a starting point that human engineers then edit and harden.

The headline numbers — what the government reported

Average time saved per developer: about 56–60 minutes per working day (commonly reported as “almost an hour”), which the government translates to ~28 working days per year saved per developer.
Licences distributed vs redeemed: 2,500 licences offered; 1,900 assigned; of these, 1,100 GitHub Copilot licences redeemed and 173 Gemini Code Assist licences redeemed in the dataset GDS published.
Acceptance and reuse: telemetry showed low raw acceptance rates for line‑level suggestions (GitHub Copilot acceptance ~15.8% in telemetry), and only a minority of outputs were used unchanged — roughly 15% of AI‑generated code was used without edits. Users reported committing suggested code less than half the time.
User sentiment: 72% of participants said the tools offered good value for their organisation; 65% reported completing tasks more quickly; 56% said they solved problems more efficiently; and 58% said they would prefer not to return to working without such assistants.

These numbers create a clear narrative: engineers perceive and experience meaningful productivity gains, but the telemetry and acceptance rates show the output rarely becomes production‑ready without human intervention.

Where the time savings actually come from

First drafts, scaffolding and lookups

The trial found that the largest chunk of time saved was in generating first drafts of code — boilerplate, helper functions, and small components that are time‑consuming to write from scratch. Another major saving came from searching for examples and documentation; assistants reduce the time spent toggling between IDE, browser and internal docs by surfacing relevant snippets and API usage.

Code review as a force multiplier

Engineers also used AICAs to pre‑review their own changes or to generate suggestions for pull requests. The assistant’s suggestions can speed iteration in code review workflows, but human reviewers still perform the safety and security checks before merging. In practice, assistants accelerate low‑risk, repetitive review work rather than replace critical gatekeeping.

Real‑world caveats: what the telemetry and independent studies reveal

The government’s self‑reported time savings are credible and consistent with independent industry research showing large productivity improvements for routine tasks when AI is used as a drafting and research aid. Still, multiple independent studies and industry voices underline significant caution:

Low acceptance rates and frequent remediation: The telemetry shows a small share of suggested lines are accepted in‑place, and most AI outputs are edited. Experts emphasise the hidden cost of remediation — the time spent debugging, securing, and integrating generated code into existing systems.
Security and vulnerability risks: Recent industry research (including large vendor studies) has repeatedly found that AI‑generated code can introduce security weaknesses at scale. Veracode’s 2025 research and other contemporaneous analyses found alarmingly high rates of insecure patterns and OWASP‑class vulnerabilities in AI outputs, particularly when prompts do not explicitly include security constraints. These results show that syntactic correctness does not imply security or architectural soundness.
Quality and maintainability issues: Independent surveys and studies indicate AI output often lacks defensive programming, introduces duplication, and can bloat codebases — issues that elevate maintenance costs over time. Many teams report having to refactor or discard portions of generated code.
Experience matters: Senior engineers tend to extract more value and spot subtle issues faster than junior staff. Several industry voices argue that inexperienced developers are more likely to over‑trust AI outputs and may fail to detect security or logical flaws.

Taken together, these findings show an important pattern: AI assistants materially speed the drafting phase, but the downstream assurance work — testing, security scanning, integration, and manual verification — remains both necessary and potentially expensive.

Expert reactions inside and outside government

Government and ministers

The government has seized the trial as proof that AI can be a lever in the wider “Plan for Change” efficiency agenda, with ministers highlighting the potential to reclaim substantial time and target large savings across public services. The political push to scale AI across government is explicit, including ambitions to realise up to £45 billion in efficiencies. The trial’s positive user sentiment is being used to justify a cautious roll‑out.

Industry voices raising caution

Industry experts and vendor‑neutral security specialists have underlined that productivity gains must be balanced against systemic risks:

Martin Reynolds, Field CTO at Harness, welcomed the trial but warned that the “velocity boost” is only the beginning. He pointed out that around 85% of AI‑generated code still needed manual edits in many settings and that the downstream stages — testing, security scanning, deployment verification — are where the major time costs and risks accumulate.
Nigel Douglas from Cloudsmith highlighted the lack of secure‑by‑design thinking in many AI workstreams and urged explicit provenance, supply‑chain verification, and tooling to detect AI‑introduced vulnerabilities before code reaches production.
Security research from multiple vendors (Veracode, Apiiro and others) has flagged a growing incidence of vulnerabilities in AI‑generated code, reinforcing that human review — plus automated, security‑oriented scanning — must remain central to delivery pipelines.

These cautionary perspectives align: AI helps, but it must be integrated with a hardened SDLC rather than bolted on as a productivity trick.

Where the practical risks live — a technical breakdown

1) Hallucinations and incorrect assumptions

AICAs can produce plausible but incorrect code, or fabricate APIs, dependencies, or behaviours that do not exist in the target environment. In sensitive government systems, such hallucinations can propagate silently and cause incorrect outputs to be merged if unchecked.

2) Security regressions and insecure defaults

AI tends to prefer short, working examples, which often omit secure defaults (input validation, proper error handling, least privilege). Studies show a non‑trivial rate of OWASP‑class vulnerabilities in AI‑generated code unless explicitly constrained by prompts and governance.

3) Increased technical debt and duplication

AICAs frequently generate code that is syntactically correct but inefficient, duplicated, or inconsistent with existing codebase patterns and abstractions. Over time, this can increase maintenance burden and obscure architectural intent.

4) Supply‑chain provenance and licence risk

Generated snippets can mirror open‑source patterns or include external package references without clear provenance; organisations need to assert whether outputs include problematic licences or callouts to unvetted repositories. Procurement and legal teams must insist on contractual controls to prevent inadvertent exposure.

5) Data leakage and model training concerns

When proprietary or sensitive code or prompts are fed into third‑party cloud models, there’s a risk that this data could be retained or used to further train vendor models unless contractual protections (and technical controls) explicitly forbid that. Public bodies have rightly prioritised “no‑training” or private hosting controls in procurement.

How to get the upside without the downside — pragmatic controls for public sector deployments

Adopting AICAs at scale in government requires an engineering and governance playbook. The trial’s lessons suggest the following practical controls:

Designate trusted use cases first. Start with low‑risk, high‑reward scenarios: internal utilities, test generation, and boilerplate code in non‑safety‑critical services.
Enforce a human‑in‑the‑loop requirement. No AI‑generated code should be merged to production without an explicit human review and automated security checks. Make that a policy step in CI pipelines.
Embed security in prompts and policy templates. Provide standardised prompt templates that force security‑oriented constraints (e.g., “generate code following least privilege and with input validation”) and measure compliance.
Operate model provenance and telemetry. Maintain thread‑level observability: which model version produced an output, when, and in response to which prompt. This supports incident triage and audit.
Isolate sensitive workloads. For critical systems, use private model hosting or on‑premises inference so proprietary code never leaves government control. Contract clauses must forbid vendor‑side model training on government inputs.
Automate security testing and policy enforcement. Add static analysis, dependency checks and policy gates in CI that run automatically on any AI‑sourced change. Treat AI outputs as suspect until proven otherwise.
Invest in developer training. Senior engineers will realise the greatest net benefits; invest in upskilling junior staff to spot AI‑introduced weakness and to use these tools safely.
Metricise the downstream cost. Measure not just hours saved in drafting but also review time, remediation time, security findings per AI‑sourced PR, and production incident attribution. Only with these metrics can organisations truly judge ROI.

Strategic implications for procurement and vendor management

The trial shows that popular AICAs are mature enough to be useful, but procurement must shift from feature checklists to model governance. Contracts should include:

Explicit non‑training clauses and data‑use terms
Model versioning and access to logs for audits
SLAs for detection and remediation of model‑linked incidents
Exit and portability clauses to avoid vendor lock‑in

Public buyers should demand proof points and technical runbooks demonstrating how vendors will handle sensitive inputs and provide isolated hosting or dedicated on‑tenant models. Rushed procurement without these safeguards risks embedding long‑term operational and security liabilities.

The cultural change: how teams must adapt

AI coding assistants will change how engineering teams organise work. The trial points to several cultural shifts:

Treat AI outputs as drafts, not deliverables.
Reward activities that catch AI errors — code review, security testing, and architecture work — just as highly as feature delivery.
Rebalance hiring and training: more emphasis on senior engineering judgement, security expertise, and systems thinking.
Embed continuous measurement and transparent reporting across departments so public trust is maintained as AI is scaled.

What the trial does not yet prove — hard limits and unanswered questions

The government’s trial is an important first step, but it does not answer several critical questions:

Will the measured time savings persist as AICAs are integrated more deeply, or will remediation costs rise non‑linearly as usage widens into complex systems?
Can procurement and legal teams reliably enforce non‑training clauses over the long term and across multiple vendors?
How do we prevent skill atrophy (teams overly reliant on AI for routine tasks) while preserving human judgement where it matters?
What is the environmental impact of scaling inference workloads across government, and how should that factor into procurement and carbon accounting? Some departmental pilots already flagged environmental concerns as an area for further study.

These open points mean scaling must stay conditional on measurable safety gates rather than political convenience.

Bottom line and recommended next steps for public‑sector IT leaders

The GDS trial demonstrates that mainstream AI coding assistants can deliver material productivity gains for routine engineering tasks across government. The trick is not whether the tools work — they do — but whether organisations can capture that upside while preventing the well‑documented downsides: security regressions, increased maintenance, and supply‑chain risk.
Recommended immediate actions:

Expand pilots into targeted, high‑value, low‑risk areas while enforcing strict human review and automated security gates.
Insert contractual non‑training and provenance requirements into all AICA procurements.
Mandate telemetry that tracks not just time saved but remediation time, security findings and PR‑level acceptance rates. Use these metrics to decide scale‑up.
Invest in training and senior oversight so the most experienced engineers guide the adoption curve.

The narrative that AI will “automatically” free up tens of millions of hours and pay for itself is seductive; the government’s figures show substantial promise. Yet the honest, practical conclusion from this trial and independent research is that AI is a powerful drafting and research assistant, not a replacement for engineering judgment or security practice. Managed well, with the right governance and tooling, AI coding assistants can be a force multiplier for public‑sector engineering — but scaling without those guardrails risks amplifying technical debt and security exposure across critical national services.

Conclusion
The UK trial is an early but instructive case study in real‑world AI adoption at scale. It shows engineers are willing and able to use AICAs productively, that measurable time savings exist, and that benefits are immediate for routine tasks. It also exposes the predictable counterpoint: most AI output is a starting point, not a finish line, and governments must invest as much in governance, secure pipelines, and organisational change as they do in licences. If public services treat these trials merely as a licence procurement exercise, they will miss the deeper transformation required to harness AI safely and sustainably. If, instead, they apply disciplined gates, measurement, and a “trust but verify” engineering culture, the payoffs could be real — and defensible — for taxpayers and citizens alike.

Source: IT Pro UK government programmers trialed AI coding assistants from Microsoft, GitHub, and Google, reporting huge time savings and productivity gains – but questions remain over security and code quality

Search

Navigation section

UK AI Coding Assistants Trial: Productivity Gains and Security Tradeoffs

Background: what the trial was and what it measured

Scope and scale

Tools tested and primary use cases

The headline numbers — what the government reported

Where the time savings actually come from

First drafts, scaffolding and lookups

Code review as a force multiplier

Real‑world caveats: what the telemetry and independent studies reveal

Expert reactions inside and outside government

Government and ministers

Industry voices raising caution

Where the practical risks live — a technical breakdown

1) Hallucinations and incorrect assumptions

2) Security regressions and insecure defaults

3) Increased technical debt and duplication

4) Supply‑chain provenance and licence risk

5) Data leakage and model training concerns

How to get the upside without the downside — pragmatic controls for public sector deployments

Strategic implications for procurement and vendor management

The cultural change: how teams must adapt

What the trial does not yet prove — hard limits and unanswered questions

Bottom line and recommended next steps for public‑sector IT leaders

Similar threads

Navigation section

UK AI Coding Assistants Trial: Productivity Gains and Security Tradeoffs

Scope and scale​

Tools tested and primary use cases​

The headline numbers — what the government reported​

Where the time savings actually come from​

First drafts, scaffolding and lookups​

Code review as a force multiplier​

Real‑world caveats: what the telemetry and independent studies reveal​

Expert reactions inside and outside government​

Government and ministers​

Industry voices raising caution​

Where the practical risks live — a technical breakdown​

1) Hallucinations and incorrect assumptions​

2) Security regressions and insecure defaults​

3) Increased technical debt and duplication​

4) Supply‑chain provenance and licence risk​

5) Data leakage and model training concerns​

How to get the upside without the downside — pragmatic controls for public sector deployments​

Strategic implications for procurement and vendor management​

The cultural change: how teams must adapt​

What the trial does not yet prove — hard limits and unanswered questions​

Bottom line and recommended next steps for public‑sector IT leaders​

Similar threads

Scope and scale

Tools tested and primary use cases

The headline numbers — what the government reported

Where the time savings actually come from

First drafts, scaffolding and lookups

Code review as a force multiplier

Real‑world caveats: what the telemetry and independent studies reveal

Expert reactions inside and outside government

Government and ministers

Industry voices raising caution

Where the practical risks live — a technical breakdown

1) Hallucinations and incorrect assumptions

2) Security regressions and insecure defaults

3) Increased technical debt and duplication

4) Supply‑chain provenance and licence risk

5) Data leakage and model training concerns

How to get the upside without the downside — pragmatic controls for public sector deployments

Strategic implications for procurement and vendor management

The cultural change: how teams must adapt

What the trial does not yet prove — hard limits and unanswered questions

Bottom line and recommended next steps for public‑sector IT leaders