• Thread Author
Microsoft Research has delivered one of the clearest, most data-driven snapshots yet of how generative AI is starting to reshape the labour market: by analyzing 200,000 anonymized US conversations with Bing Copilot, the team produced an AI applicability score for occupations that quantifies where Copilot’s capabilities overlap with real work activities — and then ranked the 40 occupations most and least exposed to today’s generative-AI capabilities. The study’s headline finding is sharp and unsettling for many knowledge workers: language-, information- and communication-heavy roles show the highest overlap with current GenAI capabilities, while physically grounded, machinery‑focused and care occupations remain the most insulated for now. (arxiv.org) (microsoft.com)

'AI Applicability by Occupation: Knowledge Work Most Exposed to GenAI'
Two-column Bing Copilot infographic with device icons arranged over a world map.Background / Overview​

Microsoft’s public research report, titled Working with AI: Measuring the Occupational Implications of Generative AI, is a preprint (arXiv) and a Microsoft Research publication that summarizes an empirical study of real Copilot use in the wild. The authors — Kiran Tomlinson, Sonia Jaffe, Will Wang, Scott Counts and Siddharth Suri — analyzed conversations sampled from a nine‑month window of Bing Copilot use in the United States (January 1, 2024 — September 30, 2024). The dataset comprises roughly 200k conversations split into two representative samples: a 100k “uniform” sample and a 100k sample enriched for explicit user feedback (thumbs up / thumbs down). (arxiv.org) (microsoft.com)
Rather than predicting what AI might do in theory, Microsoft anchored its analysis in actual user behaviour. Conversations were automatically mapped to the U.S. Department of Labor’s ONET taxonomy at the Intermediate Work Activity (IWA) level (332 IWAs), enabling direct linkage between Copilot activities and the set of work activities that compose real occupations. The team then combined three key measurements — coverage (how often Copilot is used for an IWA), completion (task success as measured by user feedback and classifier signals), and scope (the share of an activity’s tasks affected) — to compute an AI applicability score for each occupation. Occupations with higher scores are those where Copilot is already being used frequently, successfully, and across a meaningful share of the job’s activities. (arxiv.org)
Independent coverage of the study (press outlets and tech reporting) replicated the core message:
knowledge work* is where generative AI is demonstrating the most immediate traction, while hands-on manual roles are least affected by current LLM systems. That consensus appears across major tech press and business outlets summarizing Microsoft’s paper. (geekwire.com, investopedia.com)

How the study measures AI relevance — a technical primer​

Data sources and sampling​

  • Copilot-Uniform: ~100k conversations sampled uniformly across U.S. users over nine months (Jan 1 – Sep 30, 2024).
  • Copilot-Thumbs: ~100k conversations sampled for presence of explicit thumbs‑up / thumbs‑down feedback to measure task success.
    Both datasets were anonymized and privacy-scrubbed under Microsoft IRB oversight. (arxiv.org)

Classification approach​

  • The study uses a GPT‑4o‑based classifier pipeline to label each conversation with all matching Intermediate Work Activities (IWAs). Classifiers were validated against human annotators to estimate reliability.
  • Classification was performed separately for the user goal (what the user was trying to accomplish) and the AI action (what the AI actually performed). That separation is critical: it distinguishes augmentation (AI helps a human complete a task) from automation (AI performs a task directly). (arxiv.org)

The AI applicability score (what it captures)​

The score aggregates:
  • Coverage: Fraction of an occupation’s IWAs that appear frequently enough in Copilot data to matter.
  • Completion: How often the AI’s outputs are judged successful (thumbs or classifier).
  • Scope: Whether AI’s use covers a moderate or large share of the work activity.
    Combined, these produce a single occupation‑level measure that indicates how much of a job’s work activities already align with Copilot’s capabilities. The score does not claim to predict headcounts, wage impacts, or long‑term displacement — only where the technical overlap presently exists. (arxiv.org)

Key findings: who’s at the front lines and who’s on the margins​

High‑applicability occupations — the top 40​

The occupations with the highest AI applicability scores skew toward language, analysis, clerical, sales and customer-communication roles. At the very top of the ranking (highest scores) are:
  • Interpreters and Translators
  • Historians
  • Passenger Attendants
  • Sales Representatives (Services)
  • Writers and Authors
  • Customer Service Representatives
  • CNC Tool Programmers
  • Telephone Operators
  • Ticket Agents and Travel Clerks
  • Broadcast Announcers and Radio DJs
    The full top‑40 list includes many roles centered on producing, transforming or delivering information — writing, editing, translating, market research, technical writing, and clerical tasks. These are activities where LLMs excel at summarizing, drafting, translating and retrieving domain knowledge. (arxiv.org)
Industry reporting picked up the same pattern, noting that sales and marketing tasks, routine content generation and scripted customer interactions are already being handled by Copilot at scale. These outlets highlighted the paper’s practical relevance for workforce planning and upskilling. (geekwire.com, investopedia.com)

Low‑applicability occupations — the bottom 40​

Jobs at the other end of the spectrum are dominated by manual, physically intensive and on‑site roles:
  • Phlebotomists, Nursing Assistants, Hazardous Materials Removal Workers, Tire Repairers and Changers, Dishwashers, Roofers, Pile Driver Operators, Dredge Operators, Water Treatment Plant Operators, and similar roles appear among the least affected.
    These occupations rely on physical dexterity, embodied judgement, in‑person judgement or complex, safety‑critical manual operations — areas where current LLMs offer little direct replacement value. (arxiv.org)

Why this matters: immediate, measurable overlap — and real business action​

Two points make Microsoft’s study consequential beyond academic debate:
  • It measures real usage, not hypothetical automation potential. Many prior studies produced exposure estimates by mapping task descriptions to LLM capabilities in principle. Microsoft measures where real users are already asking Copilot to help, and whether those requests complete successfully. That transition from theoretical mapping to observed behaviour is important for short‑term labour planning. (arxiv.org)
  • Organizations are already realizing material savings from AI deployments. Microsoft itself reported large internal cost savings from AI in customer‑service operations; multiple outlets cite the company’s statement that AI‑driven contact‑center efficiencies saved more than $500 million in the prior year. Those operational savings are not hypothetical and — when paired with workforce restructuring — are part of the real downstream dynamics labour markets face. (Readers should note that savings figures are company disclosures reported in industry press and aggregated news coverage.) (reuters.com, itpro.com)

Strengths of the Microsoft study​

  • Real‑world behavioural data: Using 200k Copilot conversations anchors the analysis in live user behaviour rather than modelled exposure. That is a major methodological advance for near‑term impact assessment. (arxiv.org)
  • *Task‑level granularity with ONET mapping:** Using IWAs makes the mapping between AI activity and occupations more robust than assigning single tasks to conversations. This reduces over‑assignment noise and captures cross‑occupational activities. (arxiv.org)
  • Separation of user goal vs AI action: Distinguishing what users ask for from what AI does allows the study to separately estimate augmentation vs automation signals — a critical nuance for policy and management responses. (arxiv.org)
  • Multiple measures of success: Combining explicit user feedback (thumbs) with automated completion classifiers helps to control for feedback bias and to provide a measured, multi‑signal sense of when the AI “really” delivered. (arxiv.org)

Limits, caveats and risks — what the study does not (and cannot) show​

No single dataset tells the whole story. Microsoft’s paper is rigorous, but it also comes with important boundaries that should temper interpretations.
  • Platform and user sample bias: The analysis is entirely conditioned on Bing Copilot usage in the United States in early‑to‑mid 2024. Copilot’s connection to Bing search likely increases the prevalence of information gathering tasks in the sample. This means the findings capture Copilot’s user base and use‑cases, not the universe of AI systems or global usage patterns. Extrapolating beyond Copilot or beyond the US should be done cautiously. (arxiv.org)
  • Activity overlap ≠ headcount prediction: The AI applicability score measures technical overlap between current AI actions and job activities. It does not — and cannot — predict whether companies will reduce staff, reassign work, or create new roles. Past automation episodes (e.g., ATMs and bank tellers) show that productivity gains sometimes expand employment in unexpected ways. The study explicitly warns against equating high applicability with inevitable job loss. (arxiv.org)
  • Downstream business decisions are unobserved: Microsoft’s dataset shows how users use AI, not how firms change hiring, compensation or organizational structures in response. Those downstream effects are the central unknown that determines whether AI will be predominantly augmentative or displacing. (arxiv.org)
  • Modal and technical blind spots: The study focuses on LLM text interactions and does not account for non‑text AI (vision, robotics, on‑device automation) that can affect heavy‑equipment and manufacturing roles. Similarly, improvements in embodied AI and robotics could change the exposure of manual jobs in ways not captured by Copilot text chats. (arxiv.org)
  • Error modes and trust: Generative models still hallucinate, misrepresent facts, and exhibit biases. Where AI produces high volumes of output, the human cost of oversight, checking and remediation may rise — a friction not measured by simple thumbs‑up signals. Overreliance on AI outputs without rigorous verification is a real operational hazard in high‑stakes domains. (arxiv.org)

What the results imply for workers, employers and policymakers​

For workers​

  • Prioritize tasks that remain hard to automate: interpersonal negotiation, deep domain expertise, client relations, complex project management, embodied craft and leadership remain comparatively resilient.
  • Reskill deliberately: invest in prompt literacy, AI oversight, data interpretation, domain specialties and cross‑disciplinary skills that combine human judgement with AI speed.
  • Shift to higher‑value activities: use AI for boilerplate and routine work; your human value will increasingly come from synthesis, ethics, persuasion and strategic judgement. (arxiv.org)

For employers​

  • Audit roles and task decomposition: target automation where it safely reduces routine load and redeploy human talent to higher‑value activities.
  • Design reskilling pathways: plan for internal mobility, on‑the‑job AI upskilling and role redesign to capture productivity gains without sudden workforce shocks.
  • Govern AI deployment: set rules for verification, data privacy, human oversight and explainability before relying on GenAI for customer‑facing or regulated tasks. (arxiv.org)

For policymakers​

  • Monitor labour market signals: job posting trends, wage shifts and occupational transitions should be tracked to spot early dislocations and to design targeted training programs.
  • Support transition: public investment in retraining, portable benefits and bridging programs can reduce harm from rapid organizational adoption.
  • Regulate by risk: require stronger verification, transparency and audit controls where AI decisions affect safety, legal rights or economic livelihood. Multiple recent academic and field experiments show productivity benefits from Copilot‑style tools — but they do not eliminate the need for oversight. (arxiv.org)

Critical analysis: what the paper gets right — and what still worries​

Notable strengths​

  • The study pushes the field forward by grounding exposure claims in actual, large‑scale behavioural data rather than abstract task‑mapping alone. This makes it especially useful for short‑term operational planning and for understanding how employees choose to use AI today. The separation of user goal vs AI action is a methodological innovation that improves interpretability. (arxiv.org)
  • Corroboration with prior exposure estimates is strong at broad levels: Microsoft reports a high correlation with earlier human‑rated exposure metrics (Eloundou et al.), which suggests the result is not an outlier but part of a convergent understanding that knowledge work is heavily exposed to LLM capabilities. (arxiv.org)

Risks and potential blind spots​

  • Because the study is Copilot‑based, industry and platform concentration effects matter: if one or two vendors control tooling deeply integrated into enterprise workflows, the distribution of benefits and harms may be concentrated as well. This raises competition and labor power questions that the study does not address directly. (arxiv.org)
  • Speed of adoption vs safety: real cost‑savings (Microsoft’s reported contact‑center $500M figure) show that companies are already reaping measurable financial gains. Rapid adoption without careful human‑in‑the‑loop safeguards could accelerate workforce displacement and increase inequality. The public evidence of cost savings paired with layoffs in some sectors underscores how quickly operational choices can translate into personnel decisions. (reuters.com)
  • Uneven regional and sectoral impacts: the study is US‑centric and text‑centric. Countries or sectors that depend more on physical labour or that have different regulatory environments may see very different outcomes. Policy responses will therefore need to be local and sector‑sensitive. (arxiv.org)

Practical checklist for organizations deploying GenAI now​

  • Map tasks, not just roles: decompose roles into IWAs and prioritize automation where it is safe and verifiable.
  • Pilot with measurement: track time‑saved, error rates, verification costs and downstream impacts on staffing before scaling.
  • Institute human verification thresholds for high‑risk outputs and assign accountability.
  • Invest in internal re‑training and clear career paths for displaced workers.
  • Publish transparency reports on AI use in customer interactions and employment changes to build trust. (arxiv.org)

Closing assessment​

Microsoft’s Copilot‑based study provides the most concrete early map yet of where generative AI is already helpful — and where it is not. By using real user interactions, validated classifiers and O*NET mapping, the paper moves the conversation from abstract speculation to operationally useful signals: language‑ and information‑centric work shows the highest present‑day overlap with LLM capabilities, while manual, embodied and equipment‑centric work remains the most protected. That finding is consistent across independent reporting and academic work.
Yet the most consequential unknowns are not technical: they are organizational and political. The study intentionally stops short of forecasting headcount changes or wage impacts because those outcomes depend on firm choices, public policy and labour responses. Measurable cost savings and adoption (including corporate disclosures of large contact‑center savings) show that decisions will be made quickly — and often privately — unless there is stronger stakeholder oversight.
The practical takeaway is clear and urgent: AI will increasingly be an augmenting force for many jobs — but the net benefit to workers will depend on adaptation, reskilling and governance. Successful AI adoption in organisations will be defined by the combination of technical augmentation and humane, forward‑looking workforce strategy.

Note: the analysis above draws on Microsoft Research’s public report and arXiv preprint (the AI applicability methodology, data window, classification approach and Top/Bottom 40 occupation lists), complemented by independent industry coverage and contemporaneous reporting on corporate AI savings. Where corporate cost‑saving numbers (for example, reported contact‑center savings of approximately $500 million) appear, those figures are drawn from company disclosures reported in major business press and market reporting; they reflect company statements and aggregate reporting rather than causal, peer‑reviewed labour‑market studies. (arxiv.org, microsoft.com, geekwire.com, reuters.com)

Source: Cloud Wars AI and the Future of Work: Microsoft Identifies Jobs Most Vulnerable to GenAI
 

Last edited:
Back
Top