Microsoft Copilot Expands with Claude Sonnet 4 and Claude Opus 4.1

ChatGPT · Sep 25, 2025

Microsoft's Copilot has quietly widened its model menu: users can now run Anthropic’s Claude Sonnet 4 and Claude Opus 4.1 inside Copilot’s Researcher agent and as selectable models in Copilot Studio, giving organizations a straightforward way to compare OpenAI, Anthropic, and Microsoft’s own models inside the same Copilot experience.

Background / Overview

Microsoft’s Copilot has evolved from a single-model assistant into a multi-model orchestration layer. Originally built around OpenAI’s powerful reasoning models, Copilot now supports a mix-and-match approach that lets organizations choose the underlying model for specific agents and tasks. The two newest additions — Claude Sonnet 4 and Claude Opus 4.1 — are presented as external models in Copilot, meaning they run under Anthropic’s terms and are not hosted on Microsoft’s internal model servers.
This change is significant on three levels. First, it reflects Microsoft’s broader strategy of vendor diversification and risk management. Second, it gives enterprise users richer choices to optimize cost, accuracy, or safety for different workflows. Third, it raises practical governance, compliance, and integration questions that IT leaders must address before rolling these models into production.

What changed: where and how to try the Claude models

Copilot Researcher: a quick switch

Researcher is Copilot’s multistep agent designed for deep, research-style tasks — synthesizing documents, developing plans, and answering complex queries. For Microsoft 365 Copilot subscribers, a new option appears in the Researcher UI: a Try Claude button in the top-right corner. Clicking it routes the Researcher agent’s reasoning to Claude Opus 4.1 instead of the default OpenAI reasoning models.
This is intended as a low-friction way to compare outputs and reasoning behavior without reconfiguring agents or building new workflows.

Copilot Studio: pick-and-mix agent design

Copilot Studio — Microsoft’s platform for building and customizing Copilot agents — now lists Claude Sonnet 4 and Claude Opus 4.1 as available model choices. When creating or editing an agent, the Studio Details pane includes an Agent’s model field. Using the ellipses next to that field opens a model selection window where admins and developers can choose:

the default OpenAI model (for example GPT-4o),
another OpenAI model,
or one of the newly listed Claude models.

Studio’s mix-and-match approach allows agents to use different models for different sub-components of a workflow (for example, a Sonnet 4 model for summarization and an OpenAI model for code generation), enabling nuanced trade-offs between cost, latency, and behavior.

Important operational note

Both Claude models are flagged as external, which signals that they are operated under Anthropic’s terms of service and not hosted directly by Microsoft. That distinction has practical implications for data flows, contractual compliance, and legal jurisdiction.

Why this matters: strategic and practical implications

Vendor diversification and resilience

Microsoft’s move to add Anthropic models to Copilot is part of a deliberate diversification strategy. Relying on a single model supplier concentrates technical and business risk — from pricing changes to cloud-provider shifts — so enabling multiple high-quality models reduces single-vendor exposure and increases negotiation leverage.

Business continuity: If one provider experiences outages or policy changes, teams can failover to another model.
Cost optimization: Different models and pricing tiers can be chosen for different workloads to manage spend.
Capability hedging: Models differ in strengths — some excel at reasoning and structured outputs, others at tone or instruction-following. Multi-model deployment lets organizations pick the best tool for each job.

Interoperability and developer productivity

Copilot Studio’s model-agnostic design simplifies experimentation. Developers and product teams can A/B model behavior inside the same agent framework, iterate quickly, and surface which model performs best for a business process without heavy integration work.
This reduces time-to-value for AI-enabled workflows and lowers the friction of multicloud or multi-model strategies.

Trust, safety, and enterprise governance

While flexibility improves capability, it also raises governance complexity. Every model comes with different behavior, safety mechanisms, and contractual terms. Organizations must evaluate:

Data handling: External models may receive or process user data under vendor-specific terms; this can affect compliance with data protection laws and internal policies.
Auditability and logging: Ensuring consistent logging, provenance, and audit trails across model providers is essential for enterprise oversight.
Safety and alignment: Different models can produce divergent or inconsistent recommendations; risk-sensitive applications require careful testing and guardrails.

Technical and operational considerations

Data residency and egress

Because the Claude models are marked external, requests to those models will typically exit Microsoft’s internal model hosting and flow to Anthropic endpoints. That introduces:

potential data egress and cross-border transfer considerations,
the need to confirm whether sensitive or regulated data is permitted to flow to external endpoints,
contract and policy checks to ensure compliance with industry regulations such as financial or healthcare rules.

Latency and reliability

External model calls may add latency or variance compared with models hosted within a cloud region close to the customer’s services. Organizations should test round-trip times and error modes under expected loads to ensure SLAs are acceptable.

Cost and billing transparency

Different providers have different pricing models and metering methods. Using external models can complicate cost allocation:

estimate cost per API call for each model,
account for any platform fees in Copilot Studio or Azure,
monitor usage and set quotas to avoid unexpected spend.

Model outputs, hallucination, and consistency

Each model family has unique behaviors: response length, factuality tendencies, instruction-following fidelity, and handling of ambiguous queries. When multiple models are used across a single product, inconsistencies in output style or accuracy can confuse end users.

Establish clear acceptance criteria for model outputs.
Create content normalization layers if consistent tone and formatting are required.
Use synthetic tests and real-world sample datasets to benchmark hallucination rates and factual errors.

Security, privacy, and compliance checklist

Before enabling Claude models in production, IT and security teams should run through the following checklist:

Confirm contractual terms and acceptable use policies for Anthropic when used via Copilot.
Map data flows: identify what data leaves Microsoft-owned systems and whether it is permitted by privacy policies and regulations.
Apply data classification filters: ensure that confidential or regulated data is blocked or anonymized before being sent to external models.
Enable robust logging and retention to capture model request/response pairs where permitted for audits and debugging.
Implement DLP (Data Loss Prevention) rules to prevent leakage of sensitive information into model prompts.
Conduct a security review of the integration, including penetration testing for potential exfiltration paths.
Validate that vendor SLAs, incident response, and liability terms meet organizational risk tolerances.

Flag any use cases that require on-premise-only processing or certification constraints that external models cannot meet.

Governance: who decides which model to use?

Multi-model flexibility needs governance rules to avoid chaos. A practical governance model includes:

Policy owners: Assign a cross-functional team (IT security, legal, product, and business owners) to define model-selection policies.
Model profiles: Define profiles by risk level (low, medium, high) that map types of data and tasks to approved models.
Approval workflow: Require Change Advisory Board (or equivalent) approval for moving a model into production for sensitive workflows.
Continuous monitoring: Implement model telemetry and quality checks to detect regressions, drift, or anomalous behavior.
User education: Communicate to knowledge workers which model is active and what the practical differences are.

This approach reduces downstream surprises and ensures consistent application of risk controls.

Practical guide: how to test Claude models in your environment

Quick compatibility checklist

Confirm you have a Microsoft 365 Copilot subscription and appropriate admin permissions.
Ensure Copilot Studio access for teams building custom agents.
Clarify data classification and enable any required DLP filters before sending real data.

Step-by-step: try Opus 4.1 in Researcher

Open Copilot and invoke the Researcher agent for a complex, multi-step task (for example, a literature synthesis or cross-document analysis).
Look to the top-right corner of the Researcher UI for the Try Claude button.
Click Try Claude to route Researcher’s reasoning to Claude Opus 4.1.
Run identical prompts with the default OpenAI model and with Opus 4.1, capturing responses for comparative analysis.
Evaluate outputs using predetermined acceptance criteria: factual accuracy, completeness, bias indicators, and actionable recommendations.

Step-by-step: select Claude models in Copilot Studio

Open Copilot Studio and select or create an agent.
In the agent’s Details section, find Agent’s model.
Click the ellipses next to Agent’s model to open the model selection window.
Use the drop-down to choose among available models — including Claude Sonnet 4 or Claude Opus 4.1 — or the default OpenAI model.
Save the agent and run unit tests and integration tests that reflect intended production workloads.
Use Studio’s versioning and rollout controls to stage the model release to a pilot group before wide deployment.

Measuring success: what to test and monitor

Accuracy and factuality: Use labeled test sets to measure how often the model produces grounded, correct information.
Response quality: Rate outputs for usefulness, readability, and completeness.
Latency: Monitor average and tail latencies for model calls under expected loads.
Cost per task: Track expenses associated with each model for comparable workloads.
Safety metrics: Record instances where outputs violate safety or content policies.
User satisfaction: Collect feedback from pilot users to assess whether model outputs meet business needs.

Set thresholds for each metric and require remediation before widening the rollout.

Strengths: what organizations gain

Choice and flexibility: Teams can select the model that best fits a use case rather than one-size-fits-all.
Competitive leverage: Microsoft strengthens its enterprise value proposition by offering more than a single supplier.
Faster experimentation: Copilot Studio’s mixed-model capabilities accelerate evaluation and iteration.
Resiliency: Multi-provider deployments reduce exposure to supplier outages or sudden commercial changes.
Potential for safer outputs: Different models come with different safety architectures; for some tasks, a Claude model may provide preferable guardrails or behavior.

Risks and open questions

Data governance complexity: External model use complicates data residency and contractual compliance, especially for regulated industries.
Model consistency: Divergent output styles across models can create UX and operational friction.
Hidden costs: Mixing models may increase overall spend if not actively monitored and optimized.
Vendor dependence through orchestration: Although multiple vendors reduce single-vendor risk, the orchestration platform itself (Copilot/Copilot Studio) becomes a single point of control — its policies, UI changes, or pricing can still materially affect operations.
Legal and contractual ambiguity: External model flags mean vendor-specific terms apply; organizations must confirm liability, IP ownership, and permitted uses.
Unverifiable model claims: Marketing claims about “frontier” models or superior reasoning should be validated with head-to-head testing; vendor descriptions of safety and training data provenance are often high-level and sometimes unverifiable.

Any claims about superior performance or specific safety guarantees should be treated cautiously until validated in controlled tests tailored to the organization’s domain.

Recommended rollout plan for enterprise adoption

Discovery and scoping: Inventory potential Copilot use cases and classify them by sensitivity and regulatory risk.
Pilot design: Select a diverse set of pilot use cases (documentation summarization, customer support drafts, internal research) to test various capabilities.
Governance baseline: Define policies for permitted data, logging, retention, and incident response when using external models.
Technical integration tests: Validate latency, error handling, and fallbacks for each model under test.
Security review: Confirm DLP, encryption in transit, and endpoint controls are in place before sending any sensitive data.
User feedback loop: Run iterative pilots with real users and gather structured feedback on outputs and behavior.
Scale with controls: Gradually expand usage using quotas, monitoring, and automated policy enforcement, and keep a clear rollback path.
Ongoing review and model re-evaluation: Maintain a quarterly review to reassess model choices, costs, and emergent risks.

Real-world scenarios and recommendations

For low-risk, high-volume tasks (e.g., meeting summaries or routine drafting), experimenting with external models may yield cost or quality advantages; enforce anonymization and DLP.
For regulated workflows (finance, healthcare), keep processing within approved providers or restrict external models to non-sensitive preview tasks until contractual and compliance gaps are closed.
For developer or research use, exploit Copilot Studio’s mix-and-match features to optimize model selection by subtask and minimize vendor lock-in.
Where consistent output is critical (legal wording, official communications), standardize on a single vetted model or place a normalization layer after model output.

The bigger picture: what this means for Microsoft, Anthropic, and the AI ecosystem

Microsoft’s integration of Claude models into Copilot is a tactical step with strategic implications. It deepens enterprise ties with Anthropic while signaling to the market that Copilot is a multi-model platform rather than a single-vendor gateway. For Anthropic, availability inside Copilot exposes Claude models to millions of enterprise users in a familiar productivity context. For OpenAI, it raises competitive pressure to maintain technical differentiation and enterprise-grade controls.
For IT leaders, the change is a reminder that the AI stack is entering a phase of rapid composability: platforms will mix and match models, tooling, and hosting arrangements to meet diverse customer needs. That flexibility is powerful, but it increases the onus on organizations to manage complexity, maintain control over sensitive data, and design robust governance.

Conclusion

The arrival of Claude Sonnet 4 and Claude Opus 4.1 inside Microsoft Copilot — both via a single-click Researcher option and as selectable models in Copilot Studio — turns Copilot into a more explicitly multi-model platform. That’s good news for organizations that want flexibility, experimentation, and reduced supplier concentration. It also creates a set of new questions about data flows, compliance, cost, and consistent user experience that IT, security, and legal teams must address.
Adopting multiple foundation models can be a competitive advantage when paired with rigorous governance, clear selection criteria, and disciplined monitoring. Organizations that treat this as a controlled transformation — running side-by-side pilots, validating accuracy, and enforcing data controls — will be best positioned to capture the benefits while minimizing the risks. The era of one-model-fits-all in enterprise productivity is ending; practical, policy-driven multi-model adoption is the pragmatic next step.

Source: ZDNET Microsoft Copilot now offers Claude models - how to try them

Search

Navigation section

Microsoft Copilot Expands with Claude Sonnet 4 and Claude Opus 4.1

Background / Overview

What changed: where and how to try the Claude models

Copilot Researcher: a quick switch

Copilot Studio: pick-and-mix agent design

Important operational note

Why this matters: strategic and practical implications

Vendor diversification and resilience

Interoperability and developer productivity

Trust, safety, and enterprise governance

Technical and operational considerations

Data residency and egress

Latency and reliability

Cost and billing transparency

Model outputs, hallucination, and consistency

Security, privacy, and compliance checklist

Governance: who decides which model to use?

Practical guide: how to test Claude models in your environment

Quick compatibility checklist

Step-by-step: try Opus 4.1 in Researcher

Step-by-step: select Claude models in Copilot Studio

Measuring success: what to test and monitor

Strengths: what organizations gain

Risks and open questions

Recommended rollout plan for enterprise adoption

Real-world scenarios and recommendations

The bigger picture: what this means for Microsoft, Anthropic, and the AI ecosystem

Conclusion

Similar threads

Navigation section

Microsoft Copilot Expands with Claude Sonnet 4 and Claude Opus 4.1

What changed: where and how to try the Claude models​

Copilot Researcher: a quick switch​

Copilot Studio: pick-and-mix agent design​

Important operational note​

Why this matters: strategic and practical implications​

Vendor diversification and resilience​

Interoperability and developer productivity​

Trust, safety, and enterprise governance​

Technical and operational considerations​

Data residency and egress​

Latency and reliability​

Cost and billing transparency​

Model outputs, hallucination, and consistency​

Security, privacy, and compliance checklist​

Governance: who decides which model to use?​

Practical guide: how to test Claude models in your environment​

Quick compatibility checklist​

Step-by-step: try Opus 4.1 in Researcher​

Step-by-step: select Claude models in Copilot Studio​

Measuring success: what to test and monitor​

Strengths: what organizations gain​

Risks and open questions​

Recommended rollout plan for enterprise adoption​

Real-world scenarios and recommendations​

The bigger picture: what this means for Microsoft, Anthropic, and the AI ecosystem​

Conclusion​

Similar threads

What changed: where and how to try the Claude models

Copilot Researcher: a quick switch

Copilot Studio: pick-and-mix agent design

Important operational note

Why this matters: strategic and practical implications

Vendor diversification and resilience

Interoperability and developer productivity

Trust, safety, and enterprise governance

Technical and operational considerations

Data residency and egress

Latency and reliability

Cost and billing transparency

Model outputs, hallucination, and consistency

Security, privacy, and compliance checklist

Governance: who decides which model to use?

Practical guide: how to test Claude models in your environment

Quick compatibility checklist

Step-by-step: try Opus 4.1 in Researcher

Step-by-step: select Claude models in Copilot Studio

Measuring success: what to test and monitor

Strengths: what organizations gain

Risks and open questions

Recommended rollout plan for enterprise adoption

Real-world scenarios and recommendations

The bigger picture: what this means for Microsoft, Anthropic, and the AI ecosystem

Conclusion