DWP Trial Finds 19 Minutes Saved Daily with Microsoft 365 Copilot

  • Thread Author
Microsoft’s Copilot has moved from marketing demo to frontline paper trialled inside the largest UK welfare department — and the Department for Work and Pensions (DWP) now says the paid, licensed version of Microsoft 365 Copilot saved civil servants an average of 19 minutes per working day on routine tasks, based on a mixed-method evaluation of a six‑month pilot that ran from October 2024 to March 2025. (gov.uk)

Two professionals work at desks as a large Microsoft 365 Copilot infographic promises 19 minutes saved per day.Background / Overview​

The DWP trial tested the licensed Copilot across central office (non‑frontline) staff and distributed 3,549 licences to employees by a mix of volunteers and peer nominations. Fieldwork consisted of two large surveys, one of licence holders (1,716 responses) and one of a comparison group of non‑licence holders (2,535 responses), supplemented by 19 in‑depth interviews and econometric analysis using Seemingly Unrelated Regression (SUR). The DWP evaluation was published on 29 January 2026 and focuses on the licensed, tenant‑integrated Copilot available to Microsoft 365 customers. (gov.uk)
This official DWP estimate sits between other high‑profile government figures: a cross‑government Government Digital Service (GDS) experiment involving roughly 20,000 civil servants reported an average 26 minutes saved per day, published as a ministerial statement on 2 June 2025, while a departmental pilot run by the Department for Business and Trade (DBT) concluded that Copilot delivered mixed effects and did not show clear, department‑level productivity gains overall. (thegovernmentsays-files.s3.amazonaws.com)

What the DWP evaluation actually measured​

Study design and sample​

  • The DWP trial ran from October 2024 to March 2025 and targeted central office business functions (policy, digital, finance, etc.). Frontline Jobcentre colleagues were excluded from the licensed pilot. (gov.uk)
  • The evaluation combined quantitative surveys and econometric modelling with qualitative interviews to produce a counterfactual‑style estimate: the SUR models contrasted Copilot users with a stratified comparison group of non‑users and adjusted for demographic factors, job grade, business area, health conditions, and measures of AI interest and prior experience. (gov.uk)

Outcomes and metrics​

DWP measured three primary outcomes:
  • Task efficiency — self‑reported time saved per day across eight routine tasks, converted from ordinal survey categories into continuous minutes, then modelled with SUR.
  • Job satisfaction — a 7‑point Likert measure of overall satisfaction in the last three months.
  • Perceived quality of work — a 7‑point Likert measure for output quality.
The headline econometric result: a statistically significant treatment coefficient that translates to roughly 19 minutes saved per user per day on the set of eight routine tasks after adjusting for confounders. (gov.uk)

Where the time savings came from — task‑level breakdown​

DWP disaggregated time savings by task. The largest measured effects were for text and knowledge tasks rather than data‑heavy work:
  • Searching for existing information or research: 26 minutes saved per day. (gov.uk)
  • Writing emails: 25 minutes saved per day. (gov.uk)
  • Summarising information or research: 24 minutes saved per day. (gov.uk)
  • Producing or editing written materials: ~20 minutes saved per day. (gov.uk)
  • Transcribing/summarising meetings: the smallest measured saving at 9 minutes per day. (gov.uk)
These are lower‑bound estimates derived from the lower bound of each ordinal response category; the report emphasises figures are rounded and statistically significant at conventional levels. (gov.uk)

How staff used the saved time​

DWP’s qualitative interviews reveal how staff redeployed minutes saved:
  • Many respondents said the time freed up was reinvested in higher‑value tasks such as project work, strategic planning, or mentoring, rather than simply extending working hours. (gov.uk)
  • Users reported improvements in the quality of draft outputs — Copilot assisted with tone, structure, and initial drafts, particularly for emails and briefings — while emphasising the need for human editing where judgement, legal accuracy, or citations are required. (gov.uk)
  • Several interviewees described Copilot as a “comfort blanket” that reduced stress and cognitive load when handling paperwork and information overload. (gov.uk)

How DWP’s finding compares with other UK government trials​

  • The Government Digital Service (GDS) cross‑government experiment — a much larger, cross‑departmental exercise involving ≈20,000 civil servants across 12 organisations — reported a headline 26 minutes per day saved, and strong user satisfaction and adoption metrics. That experiment relied heavily on self‑reporting and did not use a formal non‑user comparison group in the same way DWP did. (thegovernmentsays-files.s3.amazonaws.com)
  • The Department for Business and Trade (DBT) ran a three‑month pilot with 1,000 licences (Oct–Dec 2024) and published an evaluation on 28 August 2025 that found high user satisfaction but no robust evidence that aggregated time savings translated into measurable productivity gains for the department; some tasks sped up while others slowed because of output quality issues. DBT combined diary studies, telemetry, and observed task timings to reach a more conservative conclusion.
Why the headlines differ: measurement choices matter. Larger scale self‑reports often produce larger headline savings, while smaller, tightly evaluated pilots with control groups and observed timings yield more conservative estimates. DWP sits in the middle: it used a comparison group and econometric adjustment, but still relied on self‑reports for the basic time‑use data, which introduces known biases. (gov.uk)

Methodological strengths and limitations — what to believe​

Strengths of the DWP evaluation​

  • Comparison group: unlike some other government studies, DWP explicitly surveyed a stratified comparison group of non‑licence holders, improving causal inference potential. (gov.uk)
  • Econometric adjustment: the SUR modelling adjusted for a wide set of covariates including job grade, business area and measures of AI interest/experience, which reduced some self‑selection bias. (gov.uk)
  • Mixed methods: combining surveys, interviews, and regression analysis provides both quantification and contextual understanding about how Copilot was used and perceived. (gov.uk)

Important caveats and risks​

  • Non‑random allocation: licences were distributed via volunteers and nominations, not random assignment, leaving potential for unobserved confounding (people who volunteer for technology pilots tend to be different). DWP acknowledges this and attempts to adjust, but limitations remain. (gov.uk)
  • Self‑reported time data: converting ordinal diary responses into continuous minutes risks over‑ or under‑estimating real elapsed time; observed task timing studies often tell a different story than diaries. DBT explicitly flagged the inflationary potential of self‑report measures.
  • No pre‑trial baseline: absence of a robust pre‑trial measurement complicates claims about net gains relative to prior working patterns. The DWP report used cross‑sectional comparisons instead. (gov.uk)
  • Task heterogeneity: Copilot helps some tasks (summaries, search, drafting) much more than others (complex Excel analyses, novel tasks) — blanket productivity claims therefore overstate nuance. DBT’s pilot found Copilot slowed data‑analysis tasks in places.
  • Hallucinations and trust: confident but incorrect outputs (“hallucinations”) remain a real hazard in public sector use, especially where incorrect content could be passed to citizens or used in decision documents. Both DBT and other departments reported hallucination incidents requiring editorial oversight.

Governance, security and training — the operational checklist​

Adopting Copilot at scale is not just a procurement question; it’s an organisational transformation problem that touches security, procurement, policy, and professional practice. Key governance considerations surfaced in the DWP report and other departmental evaluations:
  • Data protection and acceptable use: explicit policies are needed on what departmental data can be fed into Copilot, with role‑based controls and tenant settings to prevent leakage of sensitive information. (gov.uk)
  • Verification workflows: build mandatory human‑in‑the‑loop checks for outputs used in official communications, legal texts, or published decisions. Automated outputs should be treated as drafts rather than final work. (gov.uk)
  • Training that is role‑specific: DWP respondents wanted short, practical sessions tailored to the specific tasks they do, not generic demos. Targeted prompts and playbooks are more effective than one‑size‑fits‑all onboarding. (gov.uk)
  • Telemetry and ROI measurement: pair self‑reported diaries with observed task timings and telemetry (application calls, action counts) to triangulate real productivity effects. DBT’s mixed approach underlined the value of multiple data streams.
  • Accessibility and inclusion: Copilot delivered measurable accessibility benefits for neurodivergent staff and non‑native English speakers by reducing friction in drafting and summarisation — a compelling equity argument to complement efficiency claims. (gov.uk)
  • Environmental and cost audits: departments should quantify licence costs, per‑user consumption and any environmental footprint of large model usage before large rollouts; DBT flagged the need for further environmental assessment.

Practical recommendations for IT leaders and programme owners​

  • Start with tightly scoped pilots that pair users with matched non‑user comparison groups and include both diary and observed timing methods.
  • Prioritise text‑heavy business functions (policy drafting, comms, secretariat) where Copilot shows the clearest gains.
  • Implement tenant‑level governance from day one: DLP, role controls, audit logging, and a mandatory verification policy for outputs.
  • Deliver short, role‑specific training and published prompt libraries for common tasks (email drafts, search prompts, meeting minutes).
  • Measure impact holistically: time saved is valuable, but watch for offset costs (rework from poor quality outputs) and track whether time savings convert to higher‑value activities.
  • Maintain human oversight for regulated outputs and embed review steps in workflows rather than treating Copilot as a final author. (gov.uk)

The ROI question — can organisations expect to recoup license costs?​

The DWP evidence suggests measurable, daily minutes saved for many users — but turning minutes into pounds is not automatic. Licence pricing, the proportion of staff in text‑intensive roles, the degree of managerial buy‑in, and whether time savings are redeployed to revenue‑generating or cost‑saving activities all determine ROI.
  • If time savings are reinvested in higher‑value tasks (policy delivery, stakeholder engagement), the organisation may see strategic returns.
  • If saved minutes merely reduce stress or marginally shorten email time without changing higher‑order outputs, financial ROI will be modest.
  • DBT’s cautious conclusion that time savings did not obviously translate into department‑level productivity demonstrates why finance teams should insist on evidence of reallocation to high‑value outcomes before scaling licences.

Broader implications: AI assistants in public service​

Three broader lessons emerge from DWP, GDS, and DBT experiments:
  • Nuance beats hype: AI assistants are toolkits for specific pain points, not universal productivity multipliers. Different departments and roles will experience different net benefits. (gov.uk)
  • Measurement matters: self‑reporting inflates headlines. The most credible evaluations combine self‑report, telemetry, and observed timing with comparison groups where feasible. (gov.uk)
  • Governance is not optional: hallucinations, data sensitivity and quality lapses are real and can offset efficiency gains if unchecked. Robust policies and human verification must be baked into rollouts. (theregister.com)

Final analysis and verdict​

DWP’s evaluation is one of the most methodologically conscientious departmental looks at Copilot to date: it balances self‑reported user experience with econometric adjustment against a comparison group to estimate an average daily saving of 19 minutes among central functions, with strongest gains on search, summarisation and email drafting. (gov.uk)
That figure is credible — but not definitive — and should be interpreted alongside the larger GDS headline of 26 minutes (which lacked a control group) and DBT’s more cautious, task‑specific findings that time savings do not automatically equal department‑level productivity improvements. Taken together, the evidence paints a consistent picture: Copilot helps with text and knowledge work, improves draft quality and staff experience in many roles, but is not a silver bullet and requires careful governance, training, and measurement to turn minutes saved into durable value. (thegovernmentsays-files.s3.amazonaws.com)
For IT and digital leaders, the path is clear: pilot with rigor, govern tightly, measure comprehensively, and scale where task fit, cost structure, and verification workflows align. In short: treat Copilot as a powerful productivity‑adjacent tool — and design policy and process so those saved minutes translate into better public service, not just faster drafts. (gov.uk)

Source: theregister.com DWP finds Copilot saves civil servants 19 minutes a day
 

The Department for Work and Pensions’ six‑month trial of the licensed Microsoft 365 Copilot found that participating corporate staff saved an average of 19 minutes per day on routine administrative tasks, with pronounced gains in information retrieval, drafting emails, and summarising documents.

A diverse team collaborates at a table as a large Copilot dashboard shows time and task types.Background​

The DWP trial ran from October 2024 through March 2025 and involved more than 3,500 licences allocated across corporate teams (digital, policy, finance and others), using the paid/enterprise version of Microsoft 365 Copilot that is embedded into Microsoft 365 applications and connected to organisational data and compliance controls. The trial’s evaluation combined a range of quantitative and qualitative approaches, including two workforce surveys (users and a comparison non‑user group), econometric analysis, and structured qualitative interviews.
The UK government’s recent wave of public‑sector AI experiments — including large cross‑Whitehall pilots earlier in 2024–25 — set the context for DWP’s work. Other government evaluations have reported time savings ranging from roughly 19 minutes to 26 minutes per user per day in different cohorts and trial designs, while some department‑level pilots have found mixed or negligible productivity effects depending on task and role. This patchwork of results matters because measurement method and sample selection materially change headline conclusions.

What the DWP trial measured and how​

Trial design and cohorts​

The DWP evaluation focused on corporate central‑office staff (not Jobcentre frontline staff) and deliberately compared Copilot licence holders with a comparison group that did not have access to the licensed Copilot during the trial. This comparator group is important: several earlier government pilots that reported larger savings lacked a contemporaneous non‑user control, which can inflate self‑reported gains. The DWP’s analysis used Seemingly Unrelated Regression (SUR) and several model specifications to control for confounding factors including occupation, grade, demographic variables, and prior AI interest/experience.

Primary outcome measures​

Evaluators centred on three headline outcomes:
  • Time saved on routine tasks (eight task types measured),
  • Perceived quality of outputs (Likert scales and self‑report), and
  • Job satisfaction / fulfilment (Likert measures and qualitative reporting).
The econometric model that included AI‑keenness as a covariate produced the principal estimate of 19 minutes saved per user per day — statistically significant after controlling for confounders. Disaggregated task estimates pointed to larger savings in searching for information (approx. 26 minutes) and email composition (approx. 25 minutes), with smaller or no gains in some other routine activities.

Sample size and survey response​

The public write‑ups note that survey response volumes were substantial: the user survey and the comparison survey produced thousands of responses, giving the evaluation statistical power to detect modest effects. However, licence allocation was not random; it combined volunteers and peer nominations, leaving open the prospect of selection bias that the statistical models attempt to address but cannot entirely eliminate.

Key findings: productivity, quality, and wellbeing​

1) Time savings and how they were reused​

The headline estimate of 19 minutes per day equates to roughly one and a half hours per week, or three to four working days over a year for an individual — useful incremental capacity, but far smaller than some earlier headlines comparing different trials. The DWP report is careful to show that saved time was most frequently reinvested in higher‑value work: planning, project delivery, mentoring, or tasks requiring human judgement and relationship building.

2) Perceptions of work quality and fulfilment​

About 73% of Copilot users reported improvements in output quality and 65% said they felt more fulfilled in their roles. Users described Copilot as reducing cognitive load — it produced consistent first drafts, suggested phrasing, and extracted key points from documents so staff could focus on editorial judgment and decision‑making. Importantly, the evaluation emphasises that Copilot outputs still typically required human review and contextual editing.

3) Accessibility and neurodiversity benefits​

The trial surfaced notable accessibility benefits: staff who self‑identified as neurodivergent — including people with ADHD or dyslexia — reported that Copilot helped maintain task focus, scaffold written communication, and reduce friction in routine workflows. The evaluation treated these outcomes as meaningful workplace inclusion gains, while also flagging the need to ensure outputs meet accessibility standards for other service users.

Cross‑checking the broader evidence base​

The DWP result is consistent with some aspects of other civil‑service trials but differs in magnitude and method from others. For example, a wider cross‑government exercise published in mid‑2025 reported average time savings nearer 26 minutes per day across a much larger and less controlled cohort, while some department‑level trials (notably a Department for Business and Trade pilot) found that time saved in some tasks did not always translate into improved productivity because output quality varied by use case. Those differences reflect two core drivers: (a) the sensitivity of self‑reported time savings to trial design and the presence or absence of a control group; and (b) variability by task type (writing tends to benefit more than scheduling or slide generation).

Strengths of DWP’s evaluation​

  • Comparator group and econometric rigour: The DWP trial included a contemporaneous non‑user comparison group and used SUR models to control for multiple covariates, strengthening causal inference compared with uncontrolled self‑report studies.
  • Large sample and domain coverage: With over 3,500 licence allocations and thousands of survey responses, the evaluation had statistical power to detect modest effects and disaggregate by task and occupation.
  • Attention to human oversight and quality: The report repeatedly stresses the need for editorial judgment and user validation of outputs — an important counterbalance to over‑optimistic automation narratives.
  • Inclusion of accessibility outcomes: Reporting on neurodivergent staff experiences enriches the conversation about AI as an accessibility and inclusion technology rather than only a productivity engine.

Risks, limits, and areas that deserve scrutiny​

Measurement and selection bias​

Even with a comparison group and controls, the licence allocation strategy (volunteers and nominations) can leave selection bias. AI‑enthusiasts may have different baseline workflows, and self‑reported time savings can be influenced by novelty and desire to justify early adoption. The DWP models attempt to control for "AI‑keenness," but residual confounding is plausible; independent replication with random allocation would provide firmer causal claims.

Task heterogeneity​

Time savings were uneven across tasks and professions. Drafting and summarisation consistently show gains; scheduling, slide generation, and tasks that require nuanced policy judgment can show mixed or negative outcomes. Organisations must avoid the temptation to average across all roles and assume uniform benefit. The DWP’s disaggregated numbers underscore this heterogeneity.

Quality risks and the “automation bias”​

Copilot can produce plausible‑sounding output that is incorrect or lacks necessary nuance: the DWP evaluation and other pilots emphasise the need for editorial checks. Overreliance risks automation bias — staff may accept AI outputs uncritically, especially under time pressure — and that is particularly hazardous in policy or legal contexts. Continuous quality assurance and human‑in‑the‑loop workflows are essential.

Data governance, security and compliance​

Using a Copilot variant that integrates with organisational data raises real governance questions: who can access what data, what retention and logging policies are in place, and how are outputs treated under records management and FOI regimes? The DWP report describes a policy framework and compliance measures, but scaling beyond central teams will increase the surface area for misclassification, leakage or misuse unless strict controls, auditing and technical safeguards are enforced.

Equity and workforce implications​

Efficiency gains concentrated in some professions or grades can widen internal inequalities. If managers reassign “saved” minutes into extra outputs rather than genuine capacity for development, net wellbeing gains may be limited. Furthermore, long‑term deployment scenarios must address retraining, role redesign, and collective bargaining considerations. The DWP evidence of increased fulfilment is encouraging, but it should not be read as an automatic argument for headcount reduction.

Accessibility trade‑offs​

While neurodivergent staff reported benefits, the evaluation flags that AI outputs must be checked for accessibility compliance. Tools that rephrase or summarise must preserve plain‑language clarity and screen‑reader compatibility; otherwise, accessibility gains for some users may introduce barriers for others.

Practical implications for IT decision‑makers and leaders​

Implementing a Copilot‑style assistive AI across a large public body or enterprise should be treated as a full organisational change program, not simply a software rollout. Below are pragmatic steps and guardrails derived from the DWP evidence and other public‑sector pilots.

1. Pilot design and evaluation​

  • Start with targeted, role‑based pilots that include a matched control group or randomised allocation to robustly measure impact.
  • Predefine measurable KPIs (time on task, quality assessment, rework rates, end‑user satisfaction) and collect baseline data.
  • Use qualitative interviews to surface behavioural and wellbeing effects not visible in time metrics.

2. Governance and data protection​

  • Establish strict data classification rules for what may be surfaced to Copilot and what must be excluded.
  • Implement audit logging and retention policies for AI interactions and outputs.
  • Coordinate with records managers and legal teams to ensure outputs are treated appropriately under FOI, data protection, and records rules.

3. Training, onboarding, and change management​

  • Combine short technical runbooks with role‑specific use cases and red‑team exercises to illustrate failure modes.
  • Encourage peer learning and communities of practice; the DWP trial found many users relied on self‑directed learning and peer support.
  • Build explicit human‑in‑the‑loop checkpoints into workflows where accuracy and judgement are critical.

4. Accessibility and inclusion​

  • Evaluate impacts on neurodivergent staff and incorporate assistive workflows that benefit targeted cohorts.
  • Require accessibility verification on AI‑generated text intended for public consumption.

5. Monitoring and KPIs after rollout​

  • Continuously measure actual time‑on‑task, rework, error rates, and user satisfaction rather than only relying on self‑report.
  • Monitor distribution of gains across departments to spot unequal benefits and adjust licence allocation accordingly.

Cost‑benefit and procurement considerations​

The DWP trial focused on the licensed enterprise Copilot. Procurement teams should weigh:
  • Licence and integration costs against estimated time savings (19 minutes/day is modest at the individual level but aggregates across teams);
  • Ongoing costs for training, governance, and auditing; and
  • Opportunity costs of not modernising workflows.
Conservative modelling should incorporate heterogeneity: assume smaller gains in policy/legal teams and larger gains in communications and knowledge management. Quantify both direct time savings and indirect value from improved output quality and employee retention. Public bodies must also consider vendor lock‑in and interoperability with existing document management systems when negotiating enterprise contracts.

Ethical, legal, and regulatory questions to resolve​

  • Who is legally responsible for AI‑generated content that is inaccurate or discriminatory? Organisations must set clear accountability for published outputs that use AI assistance.
  • How do public bodies ensure transparency when citizens receive advice or decisions influenced by generative AI? Clear disclosure policies and oversight mechanisms are necessary.
  • What auditing standards will regulators demand for AI models that access sensitive citizen data? Compliance regimes for public services will likely harden as regulators issue sector‑specific guidance.
The DWP report recognises these issues and frames Copilot as an assistive tool requiring human oversight rather than a fully automated decision‑maker. But scaling will bring greater regulatory scrutiny and the need for transparent governance.

What the DWP trial does — and does not — prove​

The evaluation provides credible evidence that a licensed, enterprise Copilot can deliver measurable day‑to‑day efficiencies and perceived improvements in work quality and fulfilment for many corporate civil‑service staff. Its strengths are methodological: a substantial sample, a comparison group, and econometric controls that reduce but do not eliminate confounding.
What it does not prove is that Copilot is a universal productivity multiplier across all roles, nor that headline time savings automatically translate into net organisational output gains without accompanying governance, role redesign, and careful quality control. The experience of other trials — some reporting larger average savings and others finding mixed outcomes — demonstrates that context, use case, and measurement matter.

Recommendations for technology leaders considering Copilot‑style deployments​

  • Treat Copilot as a productivity multiplier that requires new processes, not as a drop‑in time saver.
  • Pilot with randomised or carefully matched control groups to establish robust evidence before wide rollout.
  • Invest in governance, logging, and records management from day one.
  • Prioritise use cases that consistently show benefits — summarisation, first‑draft writing, and information retrieval — and be cautious with nuanced policy, legal or operational decision tasks.
  • Leverage observed inclusivity benefits: consider targeted licences for staff who could gain accessibility and neurodiversity advantages, and measure those outcomes explicitly.

Final assessment​

DWP’s Microsoft 365 Copilot trial is one of the most methodologically transparent public‑sector evaluations to date. By reporting a 19‑minute daily saving alongside nuanced findings on quality, job fulfilment, and accessibility, the report offers pragmatic evidence that generative AI can augment routine office work when deployed with care.
That said, the benefits are conditional: they depend on the tasks chosen, the quality of onboarding and governance, and the continued insistence on human editorial control. Organisations seeking similar gains should follow the evidence: pilot thoughtfully, measure robustly, and design governance and training before scaling. The DWP trial points to promising productivity and inclusion dividends — but it also underlines that success requires people‑centred implementation rather than technology‑first optimism.

Source: Computing UK 19 Minutes a Day saved by CoPilot, DWP trial finds
 

Back
Top