Copilot Vision Text In Text Out on Windows: A Multimodal AI Update

ChatGPT · Wednesday at 11:53 PM

Microsoft has quietly given Copilot on Windows a practical new ability: Vision can now accept typed inputs and return typed outputs, letting you share one or more app windows or a desktop region and type questions about what Copilot sees, with replies appearing in the same Copilot chat pane rather than only as spoken coaching.

Background / Overview

Copilot’s Vision capability is Microsoft’s effort to make the PC assistant truly multimodal — able to see what’s on your screen, hear your voice, and act (in limited, permissioned ways) across apps. Early Vision previews were oriented around voice-first, coached interactions: you shared a window, spoke your question, and Copilot narrated analysis and offered step‑by‑step guidance, sometimes with visual “Highlights” that pointed to UI elements. That voice-first model solved many hands‑free scenarios but left gaps for quiet environments, meetings, or users who prefer or need typed interactions.
The October 28, 2025 Windows Insider announcement formalizes the next step: text‑in/text‑out for Vision, delivered as an update to the Copilot app via the Microsoft Store. Microsoft frames this change as a staged preview for Windows Insiders while it gathers feedback.

What precisely changed — the features at a glance

Text‑in / text‑out Vision: Initiate a Vision session and type questions about the shared app or screen; Copilot replies in text inside the same chat pane instead of speaking aloud.
Seamless modality switching: At any time you can press the microphone icon and convert a typed Vision session into a voice session, preserving conversation context so the thread continues uninterrupted.
Permissioned, session‑bound sharing: Vision only sees what the user explicitly selects to share for that session; UI feedback (a glow) visibly marks the shared window.
Staged rollout via Microsoft Store: The capability is included in Copilot app package version 1.25103.107 and later, and is being rolled out to Windows Insider channels in waves. Not every Insider will see it immediately.
Some preview limitations: The initial text Vision release does not support the Highlights overlays that visually point to UI elements — Microsoft is exploring how to integrate visual cues into typed flows without degrading clarity or privacy.

These are not surface tweaks; they change how people will reasonably use Vision day‑to‑day — especially in quiet offices, meeting rooms, and accessibility workflows where voice is impractical.

How to start a text Vision session (step‑by‑step)

Update the Copilot app from the Microsoft Store and confirm the app package is 1.25103.107 or later.
Open the Copilot composer (from the Copilot app or taskbar quick view).
Click the glasses (Vision) icon in the composer.
Toggle off Start with voice to begin in text mode.
Select the app window or desktop region to share — the selected area will glow to indicate what Copilot can see.
Type your question into the chat composer; Copilot will analyze the shared view and reply in text. Press the mic icon at any time to switch to voice.

Why this update matters — practical benefits

Discretion and context without sound: Text Vision removes the awkwardness of speaking aloud in meeting rooms, open offices, or public spaces while still allowing Copilot to use visual context. That preserves the benefits of screen‑aware assistance when voice is inappropriate.
Accessibility parity: Users who are deaf, hard of hearing, or who have speech impairments — or who simply prefer typing — can now get the same screen‑aware help as voice users, improving inclusion across abilities.
Searchable, persistent record: Typed conversations leave a textual trail you can copy, search, or export. For troubleshooting or documentation tasks, that record is more actionable than ephemeral spoken guidance.
Flexible workflows: Start in quiet text mode, then flip to voice for hands‑free follow‑ups — a workflow that was previously awkward or impossible with a voice‑only Vision preview.
Broader adoption scenarios: The change widens Vision’s usable contexts — classrooms, library environments, shared desks, and compliance‑sensitive settings where audible responses are restricted.

Independent reporting and community guides confirm this is a deliberate product move to make Vision a true multimodal assistant rather than a voice-only novelty.

Technical verification — what’s supported and what runs where

Microsoft’s public documentation and preview notes make several technical claims that shape expectations:

Versioning and rollout: The Windows Insider blog post states the text Vision feature is included in Copilot app version 1.25103.107 and higher, and is rolling out to Insiders via the Microsoft Store. Testers should check the Copilot app’s About panel or Microsoft Store history to confirm the package on their device.
Session scope and consent: Vision is explicitly session‑bound and opt‑in: you must choose which window(s) or desktop region Copilot can see for the duration of the session. The UI shows a glow around shared windows as confirmation.
Cloud + on‑device hybrid: Many Vision capabilities (OCR, contextual summarization, and deep reasoning) are cloud‑backed on typical Windows 11 PCs. Microsoft describes a Copilot+ hardware tier with dedicated neural processing units (NPUs) to enable lower‑latency, more private on‑device inference for certain workloads. Expect different performance and privacy trade‑offs depending on whether your device is Copilot+ certified.
Limitations in the preview: The current text Vision preview intentionally omits visual Highlights and some previously demonstrated overlays, while other Vision features remain gated or staged for gradual rollout.

These are the high‑level, verifiable facts; they match the Microsoft Insider post and independent reporting that tracked the same package number and staged rollout.

Important caveats and real risks

The feature expands capability, but it also raises measurable risks for users and IT teams. These are important to understand before wide adoption.

1) Privacy and telemetry tradeoffs

Microsoft’s guidance emphasizes that Vision is opt‑in and session‑scoped, but that does not eliminate all exposure risk. While images and raw audio captured during sessions are not used to train models (per Microsoft’s statements in support documentation), conversation transcripts and assistant responses are logged for safety monitoring and (in some cases) persisted in Copilot history until removed. That means sensitive on‑screen data can be captured in a chat transcript unless users or admins take steps to delete or restrict history. Treat Vision like a controlled screen‑share — what you show can be recorded in conversation context.

2) Missing visual cues in text mode

The initial text Vision release omits the Highlights overlays that make voice Vision more intuitive for pointing out UI elements. Without visual highlights, certain “show me how” scenarios (e.g., guiding you to a nested settings toggle) are less straightforward in text, and users may need to rely on more verbose instructions. Microsoft says it’s exploring ways to integrate visual cues with typed conversations; that’s an unresolved UX problem in this preview.

3) Enterprise policy & managed identities

Some enterprise tenants and managed identities (for example, certain Entra ID configurations) may be excluded or limited from Copilot Vision capabilities while Microsoft and customers sort out policy, compliance, and contractual restrictions. Organizations should not assume parity with consumer rollouts; test in your environment and consult vendor guidance before enabling on managed endpoints.

4) Cloud dependency, latency and resource differences

On non‑Copilot+ hardware, heavy reasoning and OCR are performed in the cloud, which introduces network latency and dependency on Microsoft cloud services. That affects responsiveness for interactive Vision sessions and raises additional enterprise concerns about data traversing networks. Copilot+ devices promise lower latency and more on‑device inference, but those benefits require compatible hardware and vendor certification.

5) Accuracy, hallucination, and security exposure

Any visual understanding layer can misread or misinterpret what is on screen. OCR mistakes, mis‑identified UI elements, or hallucinated summaries can lead to incorrect guidance. Worse, if a user shows a malicious page or sensitive credentials, Copilot may echo or summarize that content in the chat — creating potential data‑exfiltration or social‑engineering vectors. Admins should treat Vision sessions as high‑sensitivity events and provide clear user training and DLP controls.

Cross‑checks and independent validation

The Windows Insider team’s official announcement is the primary verification for the feature and its rollout mechanics. Independent outlets tracked the same elements — the staged rollout, the minimum app package, the modality switch behavior, and the omission of Highlights in the initial text mode — which provides corroboration beyond Microsoft’s own post. Reporting from reputable outlets confirmed the package version and that this is rolling to Insiders rather than being a global GA release yet.
A community of testers and Windows‑focused reporting hubs has also documented practical behaviors and caveats observed in Insider builds, which helps validate the user flow described in Microsoft’s blog. Those independent tests reported the visible glow around shared windows, the toggle to disable "Start with voice," and the ability to switch to voice mid‑session — matching the official documentation.
Where claims are company‑sourced (for example, usage statistics asserting higher engagement with voice), they should be treated as directional until external analytics confirm them. Early marketing metrics are useful but not the same as independent usage studies.

Practical guidance: what end users should do now

Update and verify: Install the latest Copilot app from the Microsoft Store and confirm your app package shows 1.25103.107 or later before testing the text Vision path.
Treat Vision like screen sharing: Avoid showing passwords, authentication screens, or any PII you wouldn’t share in a video call. Even if images aren’t used to train models, transcripts can persist.
Use the mic toggle wisely: If you start in text mode and later use voice, be aware that audio capture and transmission rules differ from typed text — ensure microphone and privacy settings are intentionally set.
Keep a tidy history: If a typed Vision session includes sensitive content, delete that conversation from Copilot history if your policy requires it. Understand retention defaults in your account settings.
Test on representative devices: Because cloud vs on‑device behavior can vary, test Vision workflows on the kinds of machines your team uses — including older Windows 11 devices and any Copilot+ candidate hardware — to measure latency and correctness.

Practical guidance: what IT teams and administrators should do

Create deployment guardrails: Use Group Policy, MDM, or Entra controls to restrict or allow Copilot Vision on managed endpoints until DLP, compliance, and user training are complete.
Run a pilot & record telemetry: Measure what gets shared in Vision sessions, where transcripts are stored, and whether exports flow into corporate storage. Build a runbook for deletion, audit, and incident response tied to Copilot histories.
Update security policies: Incorporate Vision into existing DLP and endpoint policies; treat Copilot session initiation as a privileged event that may require stricter supervision on high‑value hosts.
Communicate acceptable use: Publish clear user guidance about what not to share with Copilot Vision (credentials, regulated data types, proprietary code snippets) and how to delete chat histories when appropriate.
Evaluate hardware segmentation: If low latency and on‑device inference are priorities, consider Copilot+ certified PCs for power users, and validate the on‑device/off‑device mapping for the features you intend to use.

Developer and OEM implications

Opportunity for tighter app integration: Office apps and other productivity tools gain value when Copilot can reason about full-document context (not just the visible viewport). Developers should explore how to expose richer metadata to the Copilot APIs (where available) to avoid relying solely on screen OCR.
UX research payoffs: The missing Highlights overlay in text Vision highlights a broader UX problem: how to point in an interface without audio or invasive overlays. Designers will need to experiment with mixed cues — annotated screenshots, inline step numbering, or ephemeral pointers that respect privacy and clarity.
Hardware segmentation matters: OEMs that ship Copilot+ PCs have a marketing and technical incentive to certify NPUs and deliver lower-latency Vision experiences; this will create a tiered Windows AI ecosystem with differentiated user experiences.

Unanswered questions and what to watch next

When will Highlights be restored to text mode? Microsoft says it’s exploring options but has not committed to a timeline; the presence or absence of Highlights materially affects usability.
How will enterprise retention defaults be documented and enforced? Admins need clear SLA‑grade documentation on where transcripts and exports are stored and how to control retention; current guidance suggests variability across scenarios.
What privacy assurances will be codified contractually for commercial tenants? Some enterprise accounts are excluded from select Vision features today, and Microsoft and customers will need clearer contractual language for sensitive industries.
How reliably will Vision parse complex UI frameworks, custom controls, or DRM‑protected content? Practical accuracy across the long tail of enterprise and legacy apps remains to be tested.

Final analysis — balance of opportunity and caution

The arrival of text‑in/text‑out for Copilot Vision is a concrete, sensible evolution that makes the feature far more usable across real‑world contexts. It remedies a glaring practical limitation of a voice‑first Vision preview and signals Microsoft’s intent to make multimodal interaction the default interaction model for PC assistants. For general consumers and power users, this should increase adoption and usefulness; for accessibility advocates, it’s a clear win.
That said, the feature is explicitly previewed and staged: key UX elements (Highlights) are missing for now, enterprise policy coverage remains incomplete, and the privacy/retention tradeoffs require careful handling by admins and users alike. Copilot Vision is functionally a controlled screen‑share with an AI attached — and it should be treated with the same operational respect as any screen‑sharing or remote‑assistance tool.
The smart path for IT teams and cautious users is deliberate piloting: validate on representative hardware, update policies, train users not to share sensitive content, and watch Microsoft’s follow‑up updates that restore visual cues and clarify enterprise controls. Independent reporting corroborates the rollout mechanics and UX behaviors described by Microsoft, but many operational and legal questions remain for regulated environments.

Microsoft’s rollout of Vision with text‑in/text‑out is an incremental but meaningful step toward a PC experience where the assistant both sees and listens — and where typing is treated as a first‑class way to interact with visual context. The result should be a more flexible, accessible Copilot, provided users and organizations apply the right guardrails as the preview broadens beyond Insiders.

Source: Inshorts Microsoft Copilot gets Vision with text-in, text-out for Windows

Copilot Vision Text In Text Out on Windows: A Multimodal AI Update

Background / Overview​

What precisely changed — the features at a glance​

How to start a text Vision session (step‑by‑step)​

Why this update matters — practical benefits​

Technical verification — what’s supported and what runs where​

Important caveats and real risks​

1) Privacy and telemetry tradeoffs​

2) Missing visual cues in text mode​

3) Enterprise policy & managed identities​

4) Cloud dependency, latency and resource differences​

5) Accuracy, hallucination, and security exposure​

Cross‑checks and independent validation​

Practical guidance: what end users should do now​

Practical guidance: what IT teams and administrators should do​

Developer and OEM implications​

Unanswered questions and what to watch next​

Final analysis — balance of opportunity and caution​

Similar threads