The UK Department for Business and Trade’s three‑month pilot of Microsoft 365 Copilot delivered a mixed verdict: users reported high satisfaction and clear wins on routine drafting and meeting summaries, but independent evaluation found only modest, use‑case‑specific time savings and no robust, organisation‑level proof that Copilot substantially improved overall productivity during the trial period. (gov.uk)
The pilot ran from October to December 2024, with the Department for Business and Trade (DBT) distributing roughly 1,000 Copilot licences to staff and collecting quantitative and qualitative data from volunteers and a randomly selected cohort. The evaluation looked at adoption, time savings, task quality, user satisfaction and behavioural impacts, and it sits alongside a larger Government Digital Service (GDS) cross‑government experiment that involved some 20,000 employees across multiple departments during the same period. (gov.uk)
This wave of public‑sector trials was designed to answer pragmatic questions: where do AI assistants produce real value; what are the risks to accuracy, data handling and governance; and how should public organisations measure return on investment (ROI) before committing to large‑scale licensing and procurement? The government’s published material frames the pilots as exploratory evidence‑gathering exercises, not procurement decisions. (gov.uk)
Key procurement considerations for IT and finance teams:
The practical takeaway for IT decision‑makers: treat Copilot as a targeted productivity tool, not a turnkey replacement for skilled work. Rigorous pilots, role‑based rollouts, transparent vendor contracts and sustained evaluation are required before scaling enterprise‑wide licences.
A note on verification: the summary above is drawn from the Department for Business and Trade’s published pilot evaluation and the Government Digital Service’s cross‑government findings, supplemented by contemporary reporting in the press. Where reports differ (for example, in headline minutes saved), the divergence is explained by differences in sample sizes, measurement approaches and the distinction between self‑reported versus observed task timings. Readers making procurement or governance decisions should consult the full DBT and GDS reports and model ROI using local staff pay scales and role distributions rather than relying on any single headline. (gov.uk, theregister.com)
Source: theregister.com M365 Copilot fails to up productivity in UK government trial
Background
The pilot ran from October to December 2024, with the Department for Business and Trade (DBT) distributing roughly 1,000 Copilot licences to staff and collecting quantitative and qualitative data from volunteers and a randomly selected cohort. The evaluation looked at adoption, time savings, task quality, user satisfaction and behavioural impacts, and it sits alongside a larger Government Digital Service (GDS) cross‑government experiment that involved some 20,000 employees across multiple departments during the same period. (gov.uk)This wave of public‑sector trials was designed to answer pragmatic questions: where do AI assistants produce real value; what are the risks to accuracy, data handling and governance; and how should public organisations measure return on investment (ROI) before committing to large‑scale licensing and procurement? The government’s published material frames the pilots as exploratory evidence‑gathering exercises, not procurement decisions. (gov.uk)
What the evaluations measured
Scope and methodology
- DBT’s pilot evaluated real‑world use across Word, Outlook, Teams, PowerPoint, Excel, OneNote and Loop, combining telemetry with self‑reported diaries, task timing sessions and interviews. A subset of around 300 participants consented to deeper telemetry and diary analysis. (gov.uk)
- GDS’s cross‑government experiment used centralised telemetry and an 7,115‑response survey to measure adoption and self‑reported time savings across 20,000 participating employees in 12 organisations. That study intentionally captured wide variation in roles and workloads to surface aggregate patterns. (gov.uk)
- Evaluators used mixed methods: quantitative telemetry (interactions per user, adoption by app), diary and timed task comparisons (users vs. control non‑users), and qualitative interviews to capture perceptions, workarounds and consequences for accessibility and wellbeing. The short, three‑month window and Christmas period were explicitly flagged as limitations in both reports. (gov.uk)
Key metrics tracked
- Adoption rates and active usage per app (Teams, Word, Outlook, Excel, PowerPoint).
- Self‑reported and observed time savings per task type (drafting, summarising, data analysis).
- Task quality and accuracy comparisons between Copilot outputs and human work.
- Incidence and user perception of hallucinations (confidently wrong outputs).
- User satisfaction and behavioural changes (training uptake, time redirected to other tasks).
- Environmental and value‑for‑money considerations (not fully quantified in the DBT pilot). (gov.uk)
What the DBT evaluation actually found
Adoption and usage patterns
DBT’s telemetry showed modest, concentrated use: the most common interactions were in Word, Outlook and Teams, with Loop and OneNote barely used and Excel/PowerPoint showing intermittent peaks. The DBT monitoring dashboard recorded averages of a relatively small number of Copilot actions per user per day during the pilot window. Some staff used Copilot daily, but most used it weekly. These adoption patterns mirror other public‑sector pilots that report high engagement for communication‑heavy tasks and lower uptake for complex data work. (gov.uk)Time savings — modest and task specific
- The GDS cross‑government experiment reported a headline figure of 26 minutes saved per user per day on average, derived from user‑reported ranges and survey responses. That number appeared across the central experiment and influenced broader government messaging about potential aggregate gains. (gov.uk)
- DBT’s own observed data painted a more nuanced picture: Copilot users were faster and produced higher‑quality summaries and email drafts in observed sessions, but time savings for email drafting were extremely small, and for some tasks Copilot made users slower and produced lower quality outputs — notably in Excel data analysis and, in certain cases, PowerPoint creation (faster but requiring corrections). The evaluation concludes there was no robust evidence in the DBT pilot that measured time savings translated into sustained productivity gains at the departmental level. (gov.uk)
- The DBT report also cautioned that the evaluation was not designed to definitively prove that time saved became time used productively; the short timeframe and lack of long‑term follow‑up limited claims about ROI. That caveat matters: small, repeatable savings can compound into real value at scale, but proving that requires longitudinal measurement and economic modelling. (gov.uk)
Quality, hallucinations and trust
DBT participants reported instances of hallucinations — confident but incorrect or fabricated content — and around one in five responding users in the DBT cohort flagged hallucinations in outputs. Evaluators warned that hallucinations force verification steps that can erase time savings and erode trust. Both DBT and GDS emphasised the need for mandatory human review of substantive AI outputs, especially where accuracy has legal, financial or reputational implications. (gov.uk)Accessibility and unexpected benefits
Across pilots, staff with accessibility needs or those for whom English is a second language reported meaningful benefits from automated meeting transcriptions and summaries. Qualitative interviews in DBT found some users redirecting saved time to training, wellbeing or higher‑value work — but that behavioural change was inconsistent and not clearly attributable to improved productivity overall. (gov.uk)Why headlines diverge: “no discernible gain” vs “26 minutes a day”
Media coverage has varied. Some outlets emphasised DBT’s cautionary conclusion — that the department did not find robust evidence of productivity improvement — while others highlighted the GDS cross‑government headline of 26 minutes saved per day and high user satisfaction rates. Both statements can be true simultaneously because they refer to different analyses and measurement frames. Key reasons for the divergence:- Different scopes: DBT’s evaluation was a targeted departmental pilot with ~1,000 licences and 300 telemetry‑consenting participants; GDS’s experiment aggregated data from 20,000 licences and 7,115 survey responses across many departments. Aggregating across organisations can smooth departmental variance and lift averages. (gov.uk)
- Different metrics: the 26‑minute figure is a self‑reported average from the cross‑government survey; DBT relied on a mixture of observed timed tasks and diaries and explicitly cautioned against equating self‑reported time savings to verifiable productivity wins. Self‑reporting tends to inflate perceived savings relative to measured task timings. (gov.uk)
- Task mix matters: Copilot shows stronger benefits on templated writing, summarisation and meeting notes but weaker or negative effects on complex data manipulation and nuanced analytical work. A department with many data‑heavy roles may see lower net gains than an organisation dominated by letter writing and meeting management. (gov.uk)
- Trial design and timeframe: three months — truncated by the festive season — is short. Habit formation, refinement of prompts, governance rollout and training take time. Early pilots commonly report that value accrues only after targeted training and iterative change management. (gov.uk)
Cost, procurement and value‑for‑money considerations
M365 Copilot licences add a measurable per‑user cost. Public reporting and media coverage have pointed to per‑user prices in the UK ranging from low‑single‑digit business plans to premium Copilot tiers in the teens or higher per month. DBT explicitly ruled that the pilot did not supply a full financial cost‑benefit analysis; GDS emphasised the need to model ROI locally rather than apply one‑size‑fits‑all heuristics. (gov.uk)Key procurement considerations for IT and finance teams:
- Licence arithmetic: calculate realistic adoption and realised time‑saving cohorts (not just opt‑in users) and map savings to salary bands to test the break‑even threshold.
- Training & governance overhead: factor in change management, prompt engineering training, AI‑familiarisation, and the human time needed to verify outputs.
- Hidden remediation costs: if Copilot reduces draft times but increases correction time for certain tasks, net productivity can be negative unless workflows are redesigned.
- Vendor transparency: require contractual clarity on data handling, model training (whether tenant data is used to improve vendor models), and environmental metrics where relevant. DBT and GDS both requested deeper vendor disclosures for procurement decisions. (gov.uk)
Risks and technical limits
Hallucinations and accuracy
Large language models generate plausible text but can invent facts. The DBT trial documented hallucinations that required human oversight and flagged them as a real operational risk for government use. Where outputs are reused without rigorous checking, downstream decisions can be compromised. Robust governance and audit trails are mandatory. (gov.uk)Data sovereignty and privacy
Public bodies are rightly cautious about permitting external models access to sensitive mailboxes, calendars and corporate files. The DBT evaluation and cross‑government materials emphasised strict data‑scope controls in the pilot and recommended procurement clauses specifying what data is used and how it is processed. These concerns also influence feature availability: heavily restricted deployments can degrade model performance relative to less constrained consumer experiences. (gov.uk)Environmental footprint
Trial participants raised environmental concerns about the carbon intensity of large language models. DBT noted these concerns but did not quantify compute or emissions attributable to the pilot; the reports call for vendors to provide clear lifecycle and energy‑use data to support public procurement. Until measured, environmental claims remain qualitative and should be treated cautiously. (gov.uk)Human factors and governance
Managers’ attitudes significantly influenced adoption in DBT interviews — technology adoption remains a social process. Where line managers embraced Copilot and modelled safe use, adoption rose; where managers were sceptical, uptake lagged. Training, use‑case curation and enforcement of verification workflows are governance levers that materially shape outcomes. (gov.uk)Practical guidance for IT leaders and Windows/M365 admins
- Pilot deliberately: choose a narrow set of high‑volume, low‑risk tasks (meeting notes, templated emails, document summaries) for initial pilots. DBT and GDS both recommend targeted pilots before wholesale rollouts. (gov.uk)
- Measure what matters: combine telemetry with timed task observations and economic modelling that converts minutes saved into financial terms for the relevant pay bands. Avoid relying solely on survey self‑reports. (gov.uk)
- Insist on vendor transparency: require contractual assurances about data use, model training, retention, and environmental metrics. (gov.uk)
- Invest in governance and training: provide hands‑on prompt engineering workshops, clear acceptable‑use policies, and mandatory human sign‑off workflows for substantive outputs. (gov.uk)
- Segment rollout by role: deploy Copilot where it demonstrably produces net time savings and where verification overhead is low; delay or restrict use in sensitive data‑analysis roles until model behaviour is validated. (gov.uk)
Critical analysis: strengths, limitations and systemic implications
Strengths
- Copilot consistently helps with repetitive, communication‑heavy tasks: meeting transcriptions, summarisation, and email drafting see the clearest, fastest benefits. This supports staff with accessibility needs and improves inclusivity for some users. Both DBT and the cross‑government experiment found strong user satisfaction on these fronts. (gov.uk)
- When used with discipline (careful prompts, human verification), Copilot can reduce cognitive friction and speed routine steps in knowledge work. That cumulative effect can be meaningful when scaled across large cohorts and sustained over time. (gov.uk)
Limitations and risks
- Measured productivity gains are uneven: for data‑heavy analytic tasks Copilot sometimes slowed users and reduced quality, increasing correction time and risk. This directly contradicts marketing claims of broad productivity uplift and demonstrates the importance of role profiling before procurement. (gov.uk)
- Self‑reported time savings inflate perceived benefit. The DBT pilot’s mixed findings underscore the need to complement surveys with observed, timed tasks and economic modelling. (gov.uk)
- Hallucinations and governance gaps make Copilot unsuitable as an autonomous decision tool in regulated or high‑consequence contexts. Until model reliability and vendor transparency are demonstrably improved, outputs must be treated as drafts requiring human oversight. (gov.uk)
- Environmental and lifecycle costs are under‑reported in current vendor and trial disclosures. Procurement decisions that ignore compute footprint and data‑centre sourcing are incomplete. (gov.uk)
Reconciling the headlines: what editors and managers should read into this
Both the DBT evaluation’s cautious conclusion and the GDS cross‑government 26‑minute headline are valid; they simply answer related but different questions. DBT asked: did Copilot demonstrably increase departmental productivity in a verifiable, measurable way during a short pilot? The answer was: not robustly. GDS asked: across many departments and thousands of users, what are the self‑reported time‑saving patterns and adoption rates? The answer was: many users reported noticeable time savings and strong satisfaction, averaged at about 26 minutes per day. Both findings are useful inputs to procurement decisions — neither is a conclusive verdict for all organisations. (gov.uk)The practical takeaway for IT decision‑makers: treat Copilot as a targeted productivity tool, not a turnkey replacement for skilled work. Rigorous pilots, role‑based rollouts, transparent vendor contracts and sustained evaluation are required before scaling enterprise‑wide licences.
Final assessment and next steps
The DBT pilot is a measured, evidence‑based contribution to a fast‑moving debate: M365 Copilot can save time on specific, high‑volume tasks and offers clear benefits for accessibility and routine drafting, but the technology is not yet a universal productivity multiplier. Organisations should:- Run small, tightly scoped pilots with robust measurement plans.
- Prioritise training, governance and contractual transparency.
- Require vendors to disclose data handling and environmental metrics.
- Avoid treating self‑reported minutes as definitive ROI without matched observational data.
A note on verification: the summary above is drawn from the Department for Business and Trade’s published pilot evaluation and the Government Digital Service’s cross‑government findings, supplemented by contemporary reporting in the press. Where reports differ (for example, in headline minutes saved), the divergence is explained by differences in sample sizes, measurement approaches and the distinction between self‑reported versus observed task timings. Readers making procurement or governance decisions should consult the full DBT and GDS reports and model ROI using local staff pay scales and role distributions rather than relying on any single headline. (gov.uk, theregister.com)
Source: theregister.com M365 Copilot fails to up productivity in UK government trial