Microsoft has begun rolling a modest but consequential update to the Copilot app on Windows that brings a text-in / text-out path to Copilot Vision for Windows Insiders — meaning you can now share an app or screen with Copilot and type questions about what it sees, with Copilot answering in the same chat window. This change converts a Vision experience that until now was primarily voice-driven into a true multimodal path where typed input, not only speech, is a first-class way to interact with visual context, and it’s being distributed as a staged Insider preview through the Microsoft Store.
Microsoft has been evolving Copilot on Windows from a conversational sidebar into a system-level assistant that can listen, see, and (in experiments) act. Over the last year the Copilot app received a native XAML front end, voice wake‑word support, cross‑app connectors, and a growing set of Vision features that let Copilot inspect windows, OCR text, and export results into Office files. The new text-in/text-out Vision rollout is the logical next step: it acknowledges that voice is not always the preferred modality and gives Insiders the ability to type about shared screen content instead.
Key points Microsoft surfaced in the Insider announcement:
Important verification notes:
At the platform level, the move also ties back to Copilot+ hardware: Microsoft differentiates experiences between cloud-first devices and Copilot+ PCs with NPUs capable of local inference. Over time, richer local models could enable faster, privacy-friendlier Vision experiences, but today most heavy reasoning still goes to the cloud for typical devices. Beyond Windows, the trend reflects the industry’s push to normalize typed and voice interfaces with visual context, marrying the best of search, OCR, and modern LLM interfaces.
Insiders should try the new mode to evaluate real-world reliability and user workflows, IT teams should pilot under test conditions and validate compliance boundaries, and Microsoft will need to accelerate transparency around data flows and enterprise configuration options before text Vision moves from Insider preview to general availability.
For now, the addition is a sensible step toward a Windows where typing, talking, and showing are equally powerful ways to get help — but organizations and privacy-conscious users must test and govern the feature before relying on it in production or regulated contexts.
Source: Microsoft - Windows Insiders Blog Copilot on Windows: Vision with text input begins rolling out to Windows Insiders
Background and overview
Microsoft has been evolving Copilot on Windows from a conversational sidebar into a system-level assistant that can listen, see, and (in experiments) act. Over the last year the Copilot app received a native XAML front end, voice wake‑word support, cross‑app connectors, and a growing set of Vision features that let Copilot inspect windows, OCR text, and export results into Office files. The new text-in/text-out Vision rollout is the logical next step: it acknowledges that voice is not always the preferred modality and gives Insiders the ability to type about shared screen content instead. Key points Microsoft surfaced in the Insider announcement:
- Text-in / text-out Vision: A typed conversation mode for Vision sessions so users can type questions about a shared app or screen and receive text replies in the same chat window.
- How to start: Click the glasses icon in Copilot’s composer, toggle off the “Start with voice” setting, and select the app or screen to share; a visible glow will indicate what is being shared.
- Modal switch: Press the microphone icon during the session to transition from text Vision to voice Vision, and resume the conversation by speaking.
- Limitations in this preview: Some previously demoed capabilities — notably the “Highlights” visual pointers that draw overlays to show UI elements — are not available in this early text Vision release.
- Rollout: Microsoft is delivering the update as a staged Insider preview via the Microsoft Store; the blog message referenced package version 1.25103.107 and higher as the minimum for this release, although staged distribution means availability will vary by channel and device.
What changed — a practical breakdown
What you can do now (text Vision capabilities)
- Share any app window or your desktop with Copilot and type questions about the shared content.
- Receive text answers directly in the Copilot chat pane without switching to voice.
- Switch mid-session to voice by pressing the mic icon; Copilot will continue the same conversation using voice.
What remains unchanged or limited in this preview
- Visual Highlights are limited: the capability that visually points to UI elements (the “Highlights” overlays) is not supported with text-in Vision in this release; Microsoft says it’s evaluating how to bring visual cues into the text flow. Treat this as a usability limitation for tasks that depend on clear visual pointers.
- Session scope and permission model: Vision remains session‑bound and permissioned — Copilot only sees what you explicitly share. This is consistent across the Vision feature set.
- Availability: The update is staged; not all Insiders will get it immediately. Expect regional or channel gating and server-side feature flags.
How it works: UI and interaction flow
- Open the Copilot composer inside the Copilot app.
- Click the glasses icon to start a Vision session.
- Toggle off the “Start with voice” setting to use text Vision.
- Choose the app window or screen region you want to share — the selected window will display a glow to confirm it’s being shared.
- Type your questions in the chat composer; Copilot will analyze the shared visual contents (OCR, UI recognition, context) and reply in text in the same window.
- If you want to switch to voice, press the mic button — the session becomes a voice Vision session without losing prior context.
- Stop sharing by pressing Stop or the X inside the composer.
Why this matters for users and IT
For everyday users
- Privacy and etiquette: Typing Vision is a clear win for people working in shared spaces, meetings, or public areas where voice is undesirable or disruptive.
- Accessibility: Users with hearing or speech impairments or those who cannot use voice easily now have parity in Vision capability via typed input.
- Precision: Typed queries can be more deliberate and precise — useful for technical asks like “extract the table of expenses and convert to CSV” or “find HEX codes for the color swatches in the shared design window.”
For IT admins and enterprises
- Adoption path: Text Vision lowers a barrier to experimentation — it’s easier to pilot with staff who prefer typed interaction and reduces the risks around accidental audio capture.
- Policy and governance: The permissioned, session-bound nature of Vision is a positive default, but enterprise deployments should still audit Copilot settings, microphone permissions, and the Copilot app distribution channels. Administrators should confirm how conversation history, transcripts, and shared content are stored under corporate policy. Independent reporting shows Microsoft retains conversation history unless users explicitly delete it, and some transcripts may persist in account-level logs — this deserves review in sensitive environments.
Privacy, security, and data flow — what to watch
Copilot Vision’s design choices attempt to balance utility with user control, but the real-world privacy implications depend on defaults and implementation details:- Session-bound sharing and consent: Vision requires you to choose what to share; it does not run continuously. That is the clearest privacy safeguard.
- Local spotting vs cloud processing: For voice features, Microsoft uses a local wake‑word spotter with a short, transient in‑memory buffer that is not written to disk; heavier speech transcription and reasoning typically execute in the cloud unless you have Copilot+ hardware that runs richer models locally. For text Vision, the reasoning step will usually go to Microsoft’s cloud LLMs. Organizations should verify whether content is processed in region and review contractual terms for data residency and retention.
- Persistence of artifacts: Copilot’s conversation history, exported files, and any transcriptions may persist in the user’s account or on-device storage depending on settings. Users should make conscious choices about Conversation History and where Copilot saves exported artifacts (OneDrive vs local). Enterprises need to map Copilot’s data flows to governance frameworks.
- Accidental sharing risk: The visual glow that appears around a shared window reduces the risk of accidental exposure, but screen sharing mistakes remain possible. Treat Vision like any screen-share tool: verify the correct window, close confidential content in other apps, and use policies that restrict Copilot features on shared or publicly accessible endpoints.
Reliability, edge cases, and known limitations
- Highlights not present in text Vision: If your intended task relies on visual pointers that literally point on-screen to controls and fields, the current text Vision preview will not meet that need — the visual Highlights capability remains voice‑oriented for this release. Microsoft said it’s exploring how to integrate highlights with text Vision in future updates.
- OCR and UI recognition accuracy: OCR has long matured for printed text; UI element recognition across varied third‑party apps is still uneven. Expect inconsistent results in less‑common apps or highly customized enterprise UIs.
- Full-document context: For Office apps like Word, Excel and PowerPoint, Copilot Vision has previously claimed the ability to reason about full-document context when those files are shared. This capability can be powerful but raises questions about how far “full context” extends and whether hidden data (comments, metadata) is considered. Validate behavior in your environment before using Copilot for sensitive documents.
- Latency and bandwidth: Because heavy reasoning still often runs in the cloud, Vision answers will vary in speed depending on network conditions. Machines flagged as Copilot+ with dedicated NPUs may get lower-latency experiences.
Cross-referencing and verification — what we checked
To verify the update and contextual claims, reporting and Microsoft’s own Insider posts were reviewed. Microsoft’s Windows Insider blog has repeatedly described staged Copilot Vision features arriving to Insiders via the Microsoft Store (examples from April, May, July and October Insider posts documenting Vision, highlights, desktop sharing, and settings integration). Independent press outlets (The Verge, Windows Central, and major technology outlets) have also covered Copilot’s steady evolution — including document export, connectors to Gmail/Outlook, and the expansion of Vision and voice features — corroborating the staged rollout model and the direction of these features.Important verification notes:
- The announcement text the Insider blog distributed to Insiders (the material previewed to Insiders describing text Vision flows and the glasses icon interaction) was included in the preview materials we reviewed.
- The specific package number cited in the preview (1.25103.107 and higher) appears in the Insider message, but independent tracking of Copilot package versions across the Microsoft Store and third‑party telemetry did not show a widely-corroborated public listing for that exact version string at the time of this article. Treat the exact package number as the vendor-supplied minimum in the announcement; Insiders should confirm the Copilot package version on their devices after the Microsoft Store update appears. This level of version verification typically requires direct access to Microsoft Store release metadata or to an Insider device showing the installed package.
Benefits and strengths
- Multimodal parity: Offering a text path for Vision brings parity between audio and typed inputs. That’s important for accessibility, shared working environments, and users who prefer typed interaction.
- Lower friction for adoption: Typing removes social friction, increasing the likelihood Insiders will experiment with Vision in environments where voice would be disruptive.
- Smooth modality switching: The ability to jump from typed conversation to voice mid-session without losing context is a thoughtful, modern UX pattern that enables richer workflows.
- Session-level controls: The permissioned, session-bound model is sensible: users explicitly select windows to share and must stop the session when finished.
- Integration continuity: Text Vision preserves integration with other Copilot features like export to Office and connectors (once enabled), making it a useful step in actual productivity workflows rather than a toy.
Risks and open questions
- Data residency and retention: Where exactly visual content and typed queries are processed and stored (region, retention period, backups) matters for regulated industries. The preview materials do not provide enterprise-grade retention guarantees; organizations should validate via contractual controls.
- Accidental exposure: Sharing an entire desktop or the wrong window can leak sensitive data. The glow indicator is helpful but not foolproof; user training and policy controls remain necessary.
- Third-party UI fidelity: Copilot’s ability to parse and act on non‑Microsoft UI elements will vary; automation and agentic features must be tested against critical apps.
- Version-specific entitlements: Because Copilot features are delivered via package updates and gated server flags, enterprises may find inconsistent availability across fleets — important to consider for pilot programs.
- Opaque telemetry: Microsoft references internal telemetry for engagement claims (for example, higher engagement for voice users). External researchers should evaluate real-world adoption and false-positive trigger rates for wake‑word and Vision sessions.
Practical guidance for Insiders and IT pilots
- Confirm your Copilot app package version after the Microsoft Store update; check the Copilot app About page and Windows Update/Store history. If you see package 1.25103.107 or higher (as Microsoft referenced in the preview), the text Vision features should be rolling out to your channel. If you do not see the update, be patient — Microsoft stages these releases.
- Use a test profile or lab machine to validate behavior before enabling Copilot Vision broadly — especially testing:
- What content is included in exports and whether metadata is preserved.
- Where exported files are saved (OneDrive vs local).
- How conversation history and transcripts are stored and how to delete them.
- Lock down Copilot features via policy on shared endpoints:
- Disable wake‑word listening where shared PCs are common.
- Restrict Vision sharing on terminals that handle regulated data until governance is in place.
- Train users on safe visual sharing:
- Always confirm the correct window is selected.
- Close unrelated documents that might contain sensitive information.
- Prefer app-scoped sharing over full-desktop sharing when possible.
- Collect feedback and telemetry during pilot:
- Use the Copilot app’s in-product feedback flow (profile → Give feedback) to report behavior or gaps.
- Capture representative screenshots and logs (redacted) to share with Microsoft if you encounter inaccuracies or privacy questions.
The bigger picture — product strategy and market implications
Text-in Vision is not simply a convenience feature; it signals Microsoft’s intent to make Copilot a genuinely multimodal interface for Windows workflows. By lowering the friction to adopt Vision (typing instead of speaking) Microsoft broadens the potential user base and reduces barriers for enterprise pilots.At the platform level, the move also ties back to Copilot+ hardware: Microsoft differentiates experiences between cloud-first devices and Copilot+ PCs with NPUs capable of local inference. Over time, richer local models could enable faster, privacy-friendlier Vision experiences, but today most heavy reasoning still goes to the cloud for typical devices. Beyond Windows, the trend reflects the industry’s push to normalize typed and voice interfaces with visual context, marrying the best of search, OCR, and modern LLM interfaces.
Final assessment
The arrival of text-in/text-out for Copilot Vision in the Windows Insider channel is a pragmatic, user-focused improvement that expands where and how people can use on‑screen AI assistance. It improves accessibility, reduces social friction, and creates a more flexible multimodal assistant. The staged rollout and explicit session permissions are positive design choices. At the same time, important enterprise concerns — retention policies, data residency, and accidental exposure — remain unresolved in the preview materials and require careful piloting and contractual review.Insiders should try the new mode to evaluate real-world reliability and user workflows, IT teams should pilot under test conditions and validate compliance boundaries, and Microsoft will need to accelerate transparency around data flows and enterprise configuration options before text Vision moves from Insider preview to general availability.
For now, the addition is a sensible step toward a Windows where typing, talking, and showing are equally powerful ways to get help — but organizations and privacy-conscious users must test and govern the feature before relying on it in production or regulated contexts.
Source: Microsoft - Windows Insiders Blog Copilot on Windows: Vision with text input begins rolling out to Windows Insiders
