• Thread Author
The UK Department for Business and Trade’s three‑month pilot of Microsoft 365 Copilot delivered a mixed verdict: users reported high satisfaction and clear wins on routine drafting and meeting summaries, but independent evaluation found only modest, use‑case‑specific time savings and no robust, organisation‑level proof that Copilot substantially improved overall productivity during the trial period. (gov.uk)

Background​

The pilot ran from October to December 2024, with the Department for Business and Trade (DBT) distributing roughly 1,000 Copilot licences to staff and collecting quantitative and qualitative data from volunteers and a randomly selected cohort. The evaluation looked at adoption, time savings, task quality, user satisfaction and behavioural impacts, and it sits alongside a larger Government Digital Service (GDS) cross‑government experiment that involved some 20,000 employees across multiple departments during the same period. (gov.uk)
This wave of public‑sector trials was designed to answer pragmatic questions: where do AI assistants produce real value; what are the risks to accuracy, data handling and governance; and how should public organisations measure return on investment (ROI) before committing to large‑scale licensing and procurement? The government’s published material frames the pilots as exploratory evidence‑gathering exercises, not procurement decisions. (gov.uk)

What the evaluations measured​

Scope and methodology​

  • DBT’s pilot evaluated real‑world use across Word, Outlook, Teams, PowerPoint, Excel, OneNote and Loop, combining telemetry with self‑reported diaries, task timing sessions and interviews. A subset of around 300 participants consented to deeper telemetry and diary analysis. (gov.uk)
  • GDS’s cross‑government experiment used centralised telemetry and an 7,115‑response survey to measure adoption and self‑reported time savings across 20,000 participating employees in 12 organisations. That study intentionally captured wide variation in roles and workloads to surface aggregate patterns. (gov.uk)
  • Evaluators used mixed methods: quantitative telemetry (interactions per user, adoption by app), diary and timed task comparisons (users vs. control non‑users), and qualitative interviews to capture perceptions, workarounds and consequences for accessibility and wellbeing. The short, three‑month window and Christmas period were explicitly flagged as limitations in both reports. (gov.uk)

Key metrics tracked​

  • Adoption rates and active usage per app (Teams, Word, Outlook, Excel, PowerPoint).
  • Self‑reported and observed time savings per task type (drafting, summarising, data analysis).
  • Task quality and accuracy comparisons between Copilot outputs and human work.
  • Incidence and user perception of hallucinations (confidently wrong outputs).
  • User satisfaction and behavioural changes (training uptake, time redirected to other tasks).
  • Environmental and value‑for‑money considerations (not fully quantified in the DBT pilot). (gov.uk)

What the DBT evaluation actually found​

Adoption and usage patterns​

DBT’s telemetry showed modest, concentrated use: the most common interactions were in Word, Outlook and Teams, with Loop and OneNote barely used and Excel/PowerPoint showing intermittent peaks. The DBT monitoring dashboard recorded averages of a relatively small number of Copilot actions per user per day during the pilot window. Some staff used Copilot daily, but most used it weekly. These adoption patterns mirror other public‑sector pilots that report high engagement for communication‑heavy tasks and lower uptake for complex data work. (gov.uk)

Time savings — modest and task specific​

  • The GDS cross‑government experiment reported a headline figure of 26 minutes saved per user per day on average, derived from user‑reported ranges and survey responses. That number appeared across the central experiment and influenced broader government messaging about potential aggregate gains. (gov.uk)
  • DBT’s own observed data painted a more nuanced picture: Copilot users were faster and produced higher‑quality summaries and email drafts in observed sessions, but time savings for email drafting were extremely small, and for some tasks Copilot made users slower and produced lower quality outputs — notably in Excel data analysis and, in certain cases, PowerPoint creation (faster but requiring corrections). The evaluation concludes there was no robust evidence in the DBT pilot that measured time savings translated into sustained productivity gains at the departmental level. (gov.uk)
  • The DBT report also cautioned that the evaluation was not designed to definitively prove that time saved became time used productively; the short timeframe and lack of long‑term follow‑up limited claims about ROI. That caveat matters: small, repeatable savings can compound into real value at scale, but proving that requires longitudinal measurement and economic modelling. (gov.uk)

Quality, hallucinations and trust​

DBT participants reported instances of hallucinations — confident but incorrect or fabricated content — and around one in five responding users in the DBT cohort flagged hallucinations in outputs. Evaluators warned that hallucinations force verification steps that can erase time savings and erode trust. Both DBT and GDS emphasised the need for mandatory human review of substantive AI outputs, especially where accuracy has legal, financial or reputational implications. (gov.uk)

Accessibility and unexpected benefits​

Across pilots, staff with accessibility needs or those for whom English is a second language reported meaningful benefits from automated meeting transcriptions and summaries. Qualitative interviews in DBT found some users redirecting saved time to training, wellbeing or higher‑value work — but that behavioural change was inconsistent and not clearly attributable to improved productivity overall. (gov.uk)

Why headlines diverge: “no discernible gain” vs “26 minutes a day”​

Media coverage has varied. Some outlets emphasised DBT’s cautionary conclusion — that the department did not find robust evidence of productivity improvement — while others highlighted the GDS cross‑government headline of 26 minutes saved per day and high user satisfaction rates. Both statements can be true simultaneously because they refer to different analyses and measurement frames. Key reasons for the divergence:
  • Different scopes: DBT’s evaluation was a targeted departmental pilot with ~1,000 licences and 300 telemetry‑consenting participants; GDS’s experiment aggregated data from 20,000 licences and 7,115 survey responses across many departments. Aggregating across organisations can smooth departmental variance and lift averages. (gov.uk)
  • Different metrics: the 26‑minute figure is a self‑reported average from the cross‑government survey; DBT relied on a mixture of observed timed tasks and diaries and explicitly cautioned against equating self‑reported time savings to verifiable productivity wins. Self‑reporting tends to inflate perceived savings relative to measured task timings. (gov.uk)
  • Task mix matters: Copilot shows stronger benefits on templated writing, summarisation and meeting notes but weaker or negative effects on complex data manipulation and nuanced analytical work. A department with many data‑heavy roles may see lower net gains than an organisation dominated by letter writing and meeting management. (gov.uk)
  • Trial design and timeframe: three months — truncated by the festive season — is short. Habit formation, refinement of prompts, governance rollout and training take time. Early pilots commonly report that value accrues only after targeted training and iterative change management. (gov.uk)

Cost, procurement and value‑for‑money considerations​

M365 Copilot licences add a measurable per‑user cost. Public reporting and media coverage have pointed to per‑user prices in the UK ranging from low‑single‑digit business plans to premium Copilot tiers in the teens or higher per month. DBT explicitly ruled that the pilot did not supply a full financial cost‑benefit analysis; GDS emphasised the need to model ROI locally rather than apply one‑size‑fits‑all heuristics. (gov.uk)
Key procurement considerations for IT and finance teams:
  • Licence arithmetic: calculate realistic adoption and realised time‑saving cohorts (not just opt‑in users) and map savings to salary bands to test the break‑even threshold.
  • Training & governance overhead: factor in change management, prompt engineering training, AI‑familiarisation, and the human time needed to verify outputs.
  • Hidden remediation costs: if Copilot reduces draft times but increases correction time for certain tasks, net productivity can be negative unless workflows are redesigned.
  • Vendor transparency: require contractual clarity on data handling, model training (whether tenant data is used to improve vendor models), and environmental metrics where relevant. DBT and GDS both requested deeper vendor disclosures for procurement decisions. (gov.uk)

Risks and technical limits​

Hallucinations and accuracy​

Large language models generate plausible text but can invent facts. The DBT trial documented hallucinations that required human oversight and flagged them as a real operational risk for government use. Where outputs are reused without rigorous checking, downstream decisions can be compromised. Robust governance and audit trails are mandatory. (gov.uk)

Data sovereignty and privacy​

Public bodies are rightly cautious about permitting external models access to sensitive mailboxes, calendars and corporate files. The DBT evaluation and cross‑government materials emphasised strict data‑scope controls in the pilot and recommended procurement clauses specifying what data is used and how it is processed. These concerns also influence feature availability: heavily restricted deployments can degrade model performance relative to less constrained consumer experiences. (gov.uk)

Environmental footprint​

Trial participants raised environmental concerns about the carbon intensity of large language models. DBT noted these concerns but did not quantify compute or emissions attributable to the pilot; the reports call for vendors to provide clear lifecycle and energy‑use data to support public procurement. Until measured, environmental claims remain qualitative and should be treated cautiously. (gov.uk)

Human factors and governance​

Managers’ attitudes significantly influenced adoption in DBT interviews — technology adoption remains a social process. Where line managers embraced Copilot and modelled safe use, adoption rose; where managers were sceptical, uptake lagged. Training, use‑case curation and enforcement of verification workflows are governance levers that materially shape outcomes. (gov.uk)

Practical guidance for IT leaders and Windows/M365 admins​

  • Pilot deliberately: choose a narrow set of high‑volume, low‑risk tasks (meeting notes, templated emails, document summaries) for initial pilots. DBT and GDS both recommend targeted pilots before wholesale rollouts. (gov.uk)
  • Measure what matters: combine telemetry with timed task observations and economic modelling that converts minutes saved into financial terms for the relevant pay bands. Avoid relying solely on survey self‑reports. (gov.uk)
  • Insist on vendor transparency: require contractual assurances about data use, model training, retention, and environmental metrics. (gov.uk)
  • Invest in governance and training: provide hands‑on prompt engineering workshops, clear acceptable‑use policies, and mandatory human sign‑off workflows for substantive outputs. (gov.uk)
  • Segment rollout by role: deploy Copilot where it demonstrably produces net time savings and where verification overhead is low; delay or restrict use in sensitive data‑analysis roles until model behaviour is validated. (gov.uk)

Critical analysis: strengths, limitations and systemic implications​

Strengths​

  • Copilot consistently helps with repetitive, communication‑heavy tasks: meeting transcriptions, summarisation, and email drafting see the clearest, fastest benefits. This supports staff with accessibility needs and improves inclusivity for some users. Both DBT and the cross‑government experiment found strong user satisfaction on these fronts. (gov.uk)
  • When used with discipline (careful prompts, human verification), Copilot can reduce cognitive friction and speed routine steps in knowledge work. That cumulative effect can be meaningful when scaled across large cohorts and sustained over time. (gov.uk)

Limitations and risks​

  • Measured productivity gains are uneven: for data‑heavy analytic tasks Copilot sometimes slowed users and reduced quality, increasing correction time and risk. This directly contradicts marketing claims of broad productivity uplift and demonstrates the importance of role profiling before procurement. (gov.uk)
  • Self‑reported time savings inflate perceived benefit. The DBT pilot’s mixed findings underscore the need to complement surveys with observed, timed tasks and economic modelling. (gov.uk)
  • Hallucinations and governance gaps make Copilot unsuitable as an autonomous decision tool in regulated or high‑consequence contexts. Until model reliability and vendor transparency are demonstrably improved, outputs must be treated as drafts requiring human oversight. (gov.uk)
  • Environmental and lifecycle costs are under‑reported in current vendor and trial disclosures. Procurement decisions that ignore compute footprint and data‑centre sourcing are incomplete. (gov.uk)

Reconciling the headlines: what editors and managers should read into this​

Both the DBT evaluation’s cautious conclusion and the GDS cross‑government 26‑minute headline are valid; they simply answer related but different questions. DBT asked: did Copilot demonstrably increase departmental productivity in a verifiable, measurable way during a short pilot? The answer was: not robustly. GDS asked: across many departments and thousands of users, what are the self‑reported time‑saving patterns and adoption rates? The answer was: many users reported noticeable time savings and strong satisfaction, averaged at about 26 minutes per day. Both findings are useful inputs to procurement decisions — neither is a conclusive verdict for all organisations. (gov.uk)
The practical takeaway for IT decision‑makers: treat Copilot as a targeted productivity tool, not a turnkey replacement for skilled work. Rigorous pilots, role‑based rollouts, transparent vendor contracts and sustained evaluation are required before scaling enterprise‑wide licences.

Final assessment and next steps​

The DBT pilot is a measured, evidence‑based contribution to a fast‑moving debate: M365 Copilot can save time on specific, high‑volume tasks and offers clear benefits for accessibility and routine drafting, but the technology is not yet a universal productivity multiplier. Organisations should:
  • Run small, tightly scoped pilots with robust measurement plans.
  • Prioritise training, governance and contractual transparency.
  • Require vendors to disclose data handling and environmental metrics.
  • Avoid treating self‑reported minutes as definitive ROI without matched observational data.
Public‑sector experiences from DBT and the governmentwide experiment illustrate a central truth of early enterprise AI adoption: potential is real, but realising it requires discipline, realistic expectations and governance that matches the stakes of the work being assisted. (gov.uk, theregister.com)

A note on verification: the summary above is drawn from the Department for Business and Trade’s published pilot evaluation and the Government Digital Service’s cross‑government findings, supplemented by contemporary reporting in the press. Where reports differ (for example, in headline minutes saved), the divergence is explained by differences in sample sizes, measurement approaches and the distinction between self‑reported versus observed task timings. Readers making procurement or governance decisions should consult the full DBT and GDS reports and model ROI using local staff pay scales and role distributions rather than relying on any single headline. (gov.uk, theregister.com)

Source: theregister.com M365 Copilot fails to up productivity in UK government trial
 
The UK Department for Business and Trade’s three‑month pilot of Microsoft 365 Copilot delivered a familiar but important paradox: users reported real and concentrated time savings—especially on written work and meeting summaries—but the evaluation could not find robust evidence that those measured time savings translated into improved departmental productivity during the trial period. (gov.uk)

Background / Overview​

The pilot ran from October to December 2024 and provided around 1,000 M365 Copilot licences to DBT staff, mixing volunteers (about 70%) with a randomly selected cohort (about 30%) to improve representativeness. The evaluation combined telemetry, diary studies, timed observed tasks and qualitative interviews to measure use cases, time savings, output quality, user satisfaction and behavioural effects. The department published its evaluation on 28 August 2025. (gov.uk)
This departmental experiment sat alongside a larger Government Digital Service (GDS) cross‑government experiment involving roughly 20,000 participants, which produced a headline figure of 26 minutes saved per user per day. That larger, cross‑organisational experiment relied heavily on self‑reported survey data and produced a different public narrative about Copilot’s potential. The coexistence of both findings—modest per‑task savings in DBT and a larger self‑reported daily average across government—underscores how measurement choices and scope shape outcomes. (gov.uk) (gov.uk)

What DBT actually measured​

Methodology in brief​

DBT’s evaluation used a mixed‑methods design:
  • Telemetry from Microsoft usage dashboards to track interactions and app adoption.
  • A diary study completed by about 32% of participants that logged tasks, perceived time savings and edits made to Copilot outputs.
  • Observed timed tasks comparing Copilot users with control colleagues for a subset of workflows.
  • 19 qualitative interviews, including participants and a control group, to surface perceptions, confidence and behavioural changes.
The evaluation deliberately applied conservative adjustments—excluding outputs that users did not adopt and subtracting “novel” tasks (work only performed because Copilot made it available)—to avoid overstating net time saved. That conservative stance is central to how DBT framed its conclusions.

Key usage metrics reported​

DBT’s telemetry and diary findings painted a usage profile concentrated in communication apps and text tasks:
  • The most popular single use was transcribing or summarising meetings (234 diary cases), followed by writing an email (167), summarising written communications (153) and asking questions (124) in the diary sample. Many respondents said they edited grammar, tone, or corrected inaccurate facts in AI outputs.
  • Microsoft telemetry showed participants averaged approximately 1.14 actions using Copilot per working day during the pilot.
These figures show modest day‑to‑day use concentrated in predictable “sweet spot” activities rather than broad continuous reliance. They also highlight the difference between occasional, high‑impact tasks and high‑frequency trivial interactions.

What DBT found: time savings, quality, and the productivity gap​

Time savings: concentrated, not universal​

DBT reported small time savings across most use cases, with written tasks—drafting, summarising and reviewing—delivering the largest measurable per‑task time savings. In some observed sessions, Copilot users were faster and produced higher‑quality summaries and email drafts.
At the same time, DBT found that certain tasks took longer when Copilot was used. Scheduling and image generation were specifically highlighted as areas where participants spent more time than they would have without the tool. The causes were typically low output quality from Copilot or users taking on tasks they previously would not have attempted because the tool made them easy to try. The evaluation explicitly adjusted calculations for these “novel” tasks when estimating net time saved.

Quality, hallucinations and human verification​

The report recorded instances of hallucinations—confident but incorrect or fabricated content—in Copilot output. Around one in five responding users flagged hallucinations during the pilot. These inaccuracies force review and correction steps that can erase apparent time savings and erode trust in model outputs, especially in contexts with legal, financial or reputational risk. DBT and cross‑government guidance emphasised human‑in‑the‑loop review for substantive outputs. (gov.uk)

Productivity: perceived gains vs evidence of organisational change​

Crucially, DBT concluded its evaluation did not find evidence that observed time savings led to improved departmental productivity within the three‑month pilot window. Control group participants did not report noticeable changes in colleagues’ output, and DBT could not identify a robust, department‑level productivity uplift attributable to Copilot during the trial. In short: users often felt they saved time—and many did on discrete tasks—but that did not automatically translate into measurable productivity increases for the department as a whole over the pilot period.
DBT flagged the short trial length (three months, including the holiday season) and lack of long‑term follow‑up as important constraints: value from habit formation, improved prompts, workflow redesign and governance often accrues slowly and may not show in early pilots.

Where Copilot helps: the “sweet spots”​

Several recurrent use cases surfaced across DBT’s pilot and the larger cross‑government experiment:
  • Drafting and rewriting documents and emails — Copilot is fastest and most reliable when producing first drafts of templated or formulaic text.
  • Meeting transcription and summarisation — automated notes and concise action‑items are high‑frequency, repeatable wins.
  • Search and information triage — Copilot helps surface context and locate documents faster, reducing time spent hunting for content.
These are precisely the areas where LLMs can be treated as drafting accelerants: generate a base structure, then the human refines and verifies. DBT’s diary respondents repeatedly described editing grammar, tone and correcting factual errors—behaviours consistent with this collaborative workflow.

Where Copilot struggles: complexity, data, and downstream risk​

  • Data‑heavy analytical work (Excel, complex models): DBT observed cases where Copilot slowed users or produced lower‑quality analysis compared with manual work. In such tasks, model outputs often lacked contextual nuance or introduced errors that required substantial correction.
  • Creative slide decks and image generation: participants trying to generate polished PowerPoint decks or images often found the process slower and requiring more rework. Copilot could produce a draft but not a finished, presentation‑ready result.
  • Hallucination risk: factual fabrications remain an operational hazard that mandates verification workflows and governance for any output that informs decisions.
These failure modes matter because verification overhead can negate time savings and introduce operational risk if unchecked.

User experience and inclusion effects​

DBT reported strong user satisfaction: more than two‑thirds of participants rated Copilot “satisfied” or “highly satisfied,” and DBT recorded a positive Net Promoter Score for a new digital service. Satisfaction concentrated among users who focused on written tasks and meeting summaries; those trying to use Copilot for scheduling, image generation or advanced data work were less satisfied. Training also mattered: participants who completed self‑paced training reported higher satisfaction than those who attended formal training sessions.
The pilot also surfaced inclusion benefits: neurodiverse staff and non‑native English speakers reported material improvements in accessibility, comprehension and confidence thanks to meeting transcriptions and simplified summaries. These social and accessibility outcomes are important organisational benefits that extend beyond pure productivity math.

Cost, procurement and environmental considerations​

DBT did not perform a full value‑for‑money or environmental life‑cycle assessment as part of the pilot, and specifically recommended further evaluation on both fronts before any wide rollout. Participants raised concerns about the environmental impact of generative AI—concerns which remained qualitative in DBT’s analysis because the pilot lacked vendor‑level emissions attribution or a dedicated energy study. DBT advised that procurement frameworks should request vendor transparency on data handling, training usage of tenant content and environmental metrics before scaling.
For decision‑makers, the headline procurement points are straightforward:
  • Model the break‑even threshold using realistic adoption and realised‑savings cohorts rather than optimistic opt‑in figures.
  • Factor in training, governance, DLP and the human time required to verify outputs.
  • Require contractual clarity on whether tenant data is used to train vendor models and insist on environmental disclosures where relevant.

Cross‑government vs departmental results: why headlines diverge​

The apparent contradiction between DBT’s cautious departmental conclusion and the GDS cross‑government headline (26 minutes a day saved) is explainable and not necessarily contradictory:
  • Scale and sample composition: GDS aggregated data from roughly 20,000 licensees across many departments, smoothing role‑by‑role variance and lifting averages. DBT’s 1,000‑user trial was smaller and role‑specific. (gov.uk)
  • Metric choice: GDS’s 26‑minute figure came from self‑reported survey responses; DBT used diaries, observed tasks and conservative adjustments to exclude novel or unused outputs. Self‑report can overstate time‑saving perceptions relative to timed observational measures. (gov.uk)
  • Task mix: departments dominated by communication‑heavy, templated work will capture Copilot’s strengths more readily than those with data‑intensive workflows. DBT’s pilot composition affected its departmental outcome.
Both outcomes are valid within their measurement frames; the policy implication is that organisations should design pilots and ROI models that reflect local role mixes and realistic adoption patterns.

Practical guidance for IT leaders and Windows administrators​

DBT’s evaluation and cross‑government evidence converge on practical, actionable steps for organisations considering Copilot adoption:
  • Pilot deliberately and measure conservatively: pick a small set (2–4) of high‑frequency, templated tasks for an initial rollout and measure both time‑to‑complete and verification overhead.
  • Design governance from day one: mandate human review for substantive outputs, set DLP and permission guardrails, and maintain audit trails for AI‑assisted decisions.
  • Invest in training and prompt libraries: real gains correlate with user confidence and familiarity—self‑paced training had a measurable effect on satisfaction in DBT’s pilot.
  • Model ROI locally: use realistic adoption rates and role‑based time‑savings to calculate break‑even points. Avoid broad extrapolations based on cross‑organisational averages alone.
  • Demand vendor transparency: require explicit contractual language around data usage, whether tenant data may be used to fine‑tune models, and environmental metrics for lifecycle assessments.
By treating Copilot as a targeted productivity lever rather than a blanket cure, IT teams can capture concentrated gains while managing risk.

Critical analysis: strengths, limitations and material risks​

Strengths (what the evidence supports)​

  • Real, repeatable wins on written and transcription tasks. The DBT diaries and observed tasks both show measurable time savings when Copilot is used for drafting, summarising and meeting notes. These tasks are high‑frequency and low‑context—an ideal match for LLM assistance.
  • High user satisfaction where the tool fits. Two‑thirds+ satisfaction and positive NPS figures show cultural receptivity—an important enabler for sustained adoption if governance and training are in place.
  • Accessibility and inclusion benefits. Automated transcriptions and language assistance can materially improve participation and reduce cognitive burden for neurodiverse staff and non‑native English speakers.

Limitations and risks (what organisations must not overlook)​

  • Verification overhead and hallucinations. Confidently incorrect outputs are a persistent failure mode that creates hidden review costs and operational risk if outputs feed decisions without rigorous human oversight. DBT recorded hallucinations and recommended mandatory review for substantive outputs.
  • Uneven impact across task types. In some analytical or creative workflows—Excel modelling, complex data analysis, polished slide design—Copilot can slow users or require substantial rework, producing net negative effects in those domains.
  • Short pilot limitations. The trial’s three‑month timeframe, partly coinciding with a holiday period, limits claims about medium‑ and long‑term productivity gains from workflow change and habit formation. Longitudinal studies would be needed to establish sustained ROI.
  • Unquantified environmental and procurement costs. DBT called out the need for further work to measure the environmental footprint and complete value‑for‑money analysis before large‑scale procurement. These are legitimate procurement concerns that remain unresolved.

Recommended next steps for public‑sector and enterprise adopters​

  • Start with targeted pilots focused on drafting, summarising, and meeting transcription rather than broad rollouts.
  • Build mandatory human‑in‑the‑loop review for any output that will inform decisions or be reused externally.
  • Require vendors to disclose data usage and environmental metrics as part of procurement.
  • Invest in self‑paced training materials, prompt libraries and role‑specific playbooks to accelerate effective adoption.
  • Plan longitudinal follow‑up studies to measure whether per‑task time savings consolidate into measurable productivity gains over 6–18 months.
These steps follow DBT’s core recommendations and the broader evidence emerging from government experiments: pilot, govern, measure, refine. (gov.uk)

Conclusion​

The DBT Copilot pilot offers a pragmatic lesson for Windows administrators, IT leaders and public‑sector decision‑makers: Microsoft 365 Copilot can save time—especially on writing and meeting summarisation—but time saved is not the same as productivity gained. The tool’s benefits are concentrated, measurable in specific workflows, and accompanied by real risks—hallucinations, verification overhead and task‑specific slowdowns—that require governance and training to manage.
Both the DBT departmental evaluation and the larger cross‑government experiment are valuable and complementary pieces of evidence. The larger experiment highlights broader perceived time savings at scale, while DBT’s conservative, mixed‑method approach reminds decision‑makers that organisational productivity improvements require workflow redesign, training and long‑term measurement—not simply licence counts. For organisations considering Copilot, the sensible path is disciplined, role‑based pilots, mandatory human review for substantive outputs, vendor transparency on data and environmental impact, and realistic ROI models that reflect how people actually use the tool in day‑to‑day work. (gov.uk)

Source: Civil Service World Copilot: DBT trial finds time savings but no productivity improvement