• Thread Author
The UK Department for Business and Trade’s three‑month pilot of Microsoft 365 Copilot returned a cautious but informative verdict: users reported high satisfaction and clear wins on text‑based tasks, yet the measurable productivity gains were small, use‑case dependent, and offset in places by accuracy and governance risks. (assets.publishing.service.gov.uk)

Background and overview​

In October 2024 the Department for Business and Trade (DBT) issued 1,000 M365 Copilot licences to UK‑based staff for a pilot that ran through December 2024. The evaluation — a mixed quantitative and qualitative study using usage telemetry, diary logs and observed tasks — was designed to assess use cases, time savings, output quality and user satisfaction across a representative mix of volunteers (~70%) and randomly selected participants (~30%). The DBT evaluation report was published on 28 August 2025. (assets.publishing.service.gov.uk)
This departmental pilot sits alongside a larger cross‑government experiment run by the Government Digital Service (GDS) that involved around 20,000 civil servants and produced a different headline: an average of 26 minutes saved per user per day. That cross‑government experiment (publicly released 2 June 2025) emphasised larger, self‑reported time savings and strong adoption across multiple departments, illustrating how trial scope and measurement choices materially affect the headline findings. (gov.uk)

What DBT actually measured: method and scope​

Diary study, telemetry and observed tasks​

DBT combined three principal data streams:
  • Telemetry from Microsoft’s M365 Copilot dashboard to measure application usage patterns.
  • A diary study comprising three Excel sheets that collected task‑level records, satisfaction, accuracy and estimated time savings from licence holders (the diary had a 32% response rate).
  • Qualitative interviews and a limited number of observed, timed tasks to validate diary self‑reports.
The evaluation team applied standard statistical tests (Chi‑squared, Mann‑Whitney U) and normalisation to judge representativeness and to identify statistically significant differences across subgroups. DBT also adjusted time‑saving calculations for tasks flagged as “novel” — work users said they only completed because Copilot made it available. (assets.publishing.service.gov.uk)

Important caveats in the design​

  • The pilot’s short duration (three months, partly disrupted by the festive period) limits insight into long‑term behaviour change and habit formation.
  • Many metrics were self‑reported, which can inflate perceived time savings versus measured task timings — DBT explicitly noted caution when interpreting time‑saving claims.
  • The sample size and departmental focus (1,000 licences) make DBT’s results contextual, not necessarily generalisable across very different role mixes or organisations. (assets.publishing.service.gov.uk)

Key findings from DBT’s evaluation​

High user satisfaction, concentrated in written tasks​

  • Overall satisfaction was high: 72% of respondents said they were satisfied or very satisfied with M365 Copilot, and the Net Promoter Score (NPS) measured at 31, which DBT described as a good result. Satisfaction clustered most strongly in text‑centric activities such as drafting, editing and summarising documents and emails. (assets.publishing.service.gov.uk)
  • Accessibility and inclusion gains emerged as a meaningful benefit: neurodiverse staff and non‑native English speakers reported comparatively higher satisfaction, citing improved clarity, reduced friction and assistance with drafting and comprehension. (assets.publishing.service.gov.uk)

Time savings: real but generally small and task‑specific​

  • DBT found small time savings concentrated in written tasks — drafting, rewriting and summarising delivered the largest measurable gains. However, those gains were modest and did not translate in the pilot to clear, department‑level productivity improvements. Some tasks (notably scheduling and image generation) took longer when users employed Copilot, either because outputs were low quality or because users performed novel tasks only because the tool made them easy to try. (assets.publishing.service.gov.uk)
  • Contrast: the GDS cross‑government experiment aggregated across many departments and reported ~26 minutes per day saved on average — a different methodology and scale that produced a larger headline. The difference highlights how measurement choices (self‑report vs observed timing, scope of tasks, sample size) change outcomes. (gov.uk)

Accuracy, hallucinations and variable quality​

  • DBT observed inconsistencies in output quality across participants and use cases. Respondents reported seeing hallucinations in Copilot outputs — confident, incorrect or fabricated content — requiring human review. This was particularly problematic where outputs were reused without verification. (assets.publishing.service.gov.uk)
  • In data‑heavy tasks, Copilot performed less well: DBT found lower average accuracy scores for Excel data analysis when Copilot was used (versus control), and PowerPoint contents created with Copilot were significantly less accurate in certain observed tasks. Conversely, manually summarised reports aided by Copilot were more accurate and produced significant time savings. (assets.publishing.service.gov.uk)

Environmental and ethical concerns​

  • During interviews, several participants raised environmental concerns about the energy footprint of large language models and said they were less willing to use Copilot for that reason. DBT flagged a need for quantified environmental impact assessments before large‑scale adoption. The report did not produce an LCA (life‑cycle assessment) or emissions estimate; DBT called for further work in this area. (assets.publishing.service.gov.uk)

Cross‑government experiment vs departmental pilot: why headlines differ​

  • Scale and scope: DBT’s departmental pilot involved 1,000 licences and mixed volunteers/randomised participants; the GDS experiment covered roughly 20,000 licences across 12 organisations. Larger, more heterogeneous samples reduce sensitivity to role composition but can smooth out departmental variation. (assets.publishing.service.gov.uk, gov.uk)
  • Measurement approach: GDS relied more heavily on self‑reported survey responses aggregated across departments (generating the 26‑minute/day figure). DBT combined diaries with observed tasks and adjusted for outputs that users didn’t reuse — a more conservative and granular approach that reduced headline time savings. (gov.uk, assets.publishing.service.gov.uk)
  • Task mix matters: Departments heavy on administrative, text‑based workflows (drafting, meeting notes, summarisation) show stronger gains than teams working with complex data, sensitive information or high‑stakes analytical tasks. DBT’s conclusion explicitly names roles with heavy administrative burdens and limited complex data as the ones that reported the most benefit. (assets.publishing.service.gov.uk)

How the DBT findings align with other public‑sector pilots​

DBT’s measured, use‑case focused conclusions mirror a consistent pattern found in other government and institutional pilots: clear wins on templated, communication‑heavy activities; weaker performance and potentially negative impact on complex data tasks; accessibility gains for specific user groups; and consistent worries around hallucinations, governance and environmental cost. This pattern has shown up in multiple independent agency trials and cross‑jurisdictional evaluations.

International contrast: the US federal approach​

By comparison, the US federal procurement and adoption pathway moved faster in scale and commercial integration. In September 2025 the U.S. General Services Administration (GSA) announced a OneGov agreement with Microsoft that makes Microsoft 365 Copilot available at no cost for up to 12 months for eligible Microsoft G5 federal customers and projects potential first‑year savings in the low billions of dollars; Microsoft and the GSA projected a figure of roughly $3.1 billion in first‑year savings tied to blended discounts across cloud and productivity services. Microsoft leadership framed the offer as a means to accelerate federal AI adoption at scale. (gsa.gov, blogs.microsoft.com)
This contrast explains the Tech and policy headlines: the DBT pilot is deliberately conservative and diagnostic; the US OneGov approach is an expansive commercial enablement designed to accelerate adoption rapidly while leveraging discounted pricing to reduce procurement friction.

Critical analysis: strengths, limits and risks​

What DBT got right: disciplined, evidence‑first piloting​

  • DBT’s mixed‑method approach and the adjustment rules for novelty/unreused outputs produce a more conservative, arguably more realistic view of immediate operational value.
  • The pilot identified where Copilot provides value rather than assuming blanket uplift; that targeted intelligence is operationally useful for IT teams and policy makers.
  • Highlighting accessibility benefits for neurodiverse staff and non‑native English speakers surfaces an equity dimension often overlooked in ROI discussions. (assets.publishing.service.gov.uk)

Practical risks that demand governance​

  • Hallucinations: confident but incorrect outputs require human verification. In regulated or high‑consequence domains this is a show‑stopper without strict review workflows.
  • Data security and scope creep: embedded assistants can tempt users to surface sensitive data; governance — acceptable use policies, tenant‑level controls and logging/audit — must be enforced.
  • Vendor lock‑in and procurement dynamics: large discounts or one‑off commercial deals (as in the US) can create switching costs that deserve scrutiny from procurement and competition perspectives. (gsa.gov, propublica.org)

Measurement pitfalls and ROI illusions​

  • Self‑reported time savings are useful signals but can overestimate net gains if verification and correction time is ignored. DBT’s adjustments for unused outputs and novel tasks are a practical best practice for any pilot trying to measure net value. (assets.publishing.service.gov.uk)

Environmental impact: a credible but under‑measured concern​

  • Employee concern about the carbon footprint of LLMs is real and, crucially, DBT called for quantified analysis; the pilot does not contain that quantified environmental accounting. Any organisation planning scale‑up should demand vendor transparency on energy use, data‑centre locations, and opportunities for efficiency (e.g., batching, on‑prem options, renewable energy sourcing). (assets.publishing.service.gov.uk)

Recommendations for IT leaders, Windows admins and policy teams​

These pragmatic steps draw on DBT’s lessons and cross‑government practice.

1. Pilot deliberately, measure defensibly​

  • Define a narrow set of target use cases (for DBT: drafting, summarising, meeting notes).
  • Use a combination of observed tasks and diaries; adjust time savings for outputs that are discarded or require significant rework.
  • Track not only time saved but time reallocated (training, higher‑value work) to build a fuller ROI story.

2. Enforce governance and role‑based scoping​

  • Implement acceptable use policies, data‑scoping rules and read‑only connectors where possible.
  • Enforce mandatory human review on outputs used for decision‑making, finance, legal or other high‑risk activities.

3. Train for prompts and verification​

  • Invest in prompting literacy and “trust but verify” practices. DBT found that self‑led training correlated with higher satisfaction; hands‑on, role‑specific guidance accelerates value. (assets.publishing.service.gov.uk)

4. Build measurement into procurement​

  • Contracts should require transparency on environmental metrics, audit logging and mechanisms to export or migrate data to avoid lock‑in.
  • Pilot cost models should include verification overheads and potential remediation cost when outputs are used incorrectly.

5. Start with groups that benefit most​

  • Prioritise administrative teams, communications, policy drafting and teams with heavy meeting loads — these are most likely to see early, credible wins. Avoid rolling out to high‑risk analytical teams until the model’s behaviour on domain data is proven.

What to watch next​

  • Vendor features that respond to DBT and cross‑government feedback: Copilot Agents, improved Excel/Power BI integrations, and better guardrails against hallucinations.
  • Public sector procurement and competition scrutiny where large, government‑wide discount deals may affect vendor choice and long‑term costs.
  • The arrival of quantified environmental impact assessments for LLM workloads — currently a gap DBT flagged for action. (assets.publishing.service.gov.uk, gov.uk)

Conclusion​

DBT’s Copilot pilot is a model of measured evaluation: it surfaces true, targeted value while highlighting the verification, governance and environmental work that remains. The headline story is not binary — Copilot is neither a panacea nor worthless. Instead, it is an enabling tool with clear productivity benefit in text‑based and administrative workflows, tangible accessibility gains for some user groups, and demonstrable limits in complex, data‑sensitive tasks.
For IT leaders and Windows admins, the operational takeaway is straightforward: pilot with discipline, measure net value, govern tightly, and train deliberately. Scaling without those steps risks turning modest time‑savings into hidden costs. The DBT evidence set — conservative, methodical and realistic — should be the template for public‑sector and enterprise rollouts that aim to convert AI hype into reliable, repeatable value. (assets.publishing.service.gov.uk, gov.uk, gsa.gov)

Source: TechHQ https://techhq.com/news/uk-government-copilot-trial-department-business-and-trade-minor-gains-some-areas/