Copilot Vision for Windows Insiders Adds Text-In Text-Out Multimodal Sharing

  • Thread Author
Microsoft has begun rolling a modest but consequential update to the Copilot app on Windows that brings a text-in / text-out path to Copilot Vision for Windows Insiders — meaning you can now share an app or screen with Copilot and type questions about what it sees, with Copilot answering in the same chat window. This change converts a Vision experience that until now was primarily voice-driven into a true multimodal path where typed input, not only speech, is a first-class way to interact with visual context, and it’s being distributed as a staged Insider preview through the Microsoft Store.

Futuristic Copilot Vision UI showing a voice-chat panel beside a glowing OCR text document.Background and overview​

Microsoft has been evolving Copilot on Windows from a conversational sidebar into a system-level assistant that can listen, see, and (in experiments) act. Over the last year the Copilot app received a native XAML front end, voice wake‑word support, cross‑app connectors, and a growing set of Vision features that let Copilot inspect windows, OCR text, and export results into Office files. The new text-in/text-out Vision rollout is the logical next step: it acknowledges that voice is not always the preferred modality and gives Insiders the ability to type about shared screen content instead.
Key points Microsoft surfaced in the Insider announcement:
  • Text-in / text-out Vision: A typed conversation mode for Vision sessions so users can type questions about a shared app or screen and receive text replies in the same chat window.
  • How to start: Click the glasses icon in Copilot’s composer, toggle off the “Start with voice” setting, and select the app or screen to share; a visible glow will indicate what is being shared.
  • Modal switch: Press the microphone icon during the session to transition from text Vision to voice Vision, and resume the conversation by speaking.
  • Limitations in this preview: Some previously demoed capabilities — notably the “Highlights” visual pointers that draw overlays to show UI elements — are not available in this early text Vision release.
  • Rollout: Microsoft is delivering the update as a staged Insider preview via the Microsoft Store; the blog message referenced package version 1.25103.107 and higher as the minimum for this release, although staged distribution means availability will vary by channel and device.
This is a continuation of the staged approach Microsoft has used for Copilot features: new capabilities frequently land in Windows Insider flights and in the Copilot app package updates, then broaden to mainstream Windows 11 users after feedback and telemetry collection. The same pattern has been used for highlights, two‑app Vision sharing, and the “Hey, Copilot” wake‑word rollout.

What changed — a practical breakdown​

What you can do now (text Vision capabilities)​

  • Share any app window or your desktop with Copilot and type questions about the shared content.
  • Receive text answers directly in the Copilot chat pane without switching to voice.
  • Switch mid-session to voice by pressing the mic icon; Copilot will continue the same conversation using voice.

What remains unchanged or limited in this preview​

  • Visual Highlights are limited: the capability that visually points to UI elements (the “Highlights” overlays) is not supported with text-in Vision in this release; Microsoft says it’s evaluating how to bring visual cues into the text flow. Treat this as a usability limitation for tasks that depend on clear visual pointers.
  • Session scope and permission model: Vision remains session‑bound and permissioned — Copilot only sees what you explicitly share. This is consistent across the Vision feature set.
  • Availability: The update is staged; not all Insiders will get it immediately. Expect regional or channel gating and server-side feature flags.

How it works: UI and interaction flow​

  • Open the Copilot composer inside the Copilot app.
  • Click the glasses icon to start a Vision session.
  • Toggle off the “Start with voice” setting to use text Vision.
  • Choose the app window or screen region you want to share — the selected window will display a glow to confirm it’s being shared.
  • Type your questions in the chat composer; Copilot will analyze the shared visual contents (OCR, UI recognition, context) and reply in text in the same window.
  • If you want to switch to voice, press the mic button — the session becomes a voice Vision session without losing prior context.
  • Stop sharing by pressing Stop or the X inside the composer.
This flow is intentionally simple: users who prefer quiet or are in public environments now have a typed alternative to the earlier voice-first Vision path that coached users aloud. The modal transition (text → voice) is also an important usability touch: it creates a fluid multimodal conversation rather than siloed voice-only or type-only experiences.

Why this matters for users and IT​

For everyday users​

  • Privacy and etiquette: Typing Vision is a clear win for people working in shared spaces, meetings, or public areas where voice is undesirable or disruptive.
  • Accessibility: Users with hearing or speech impairments or those who cannot use voice easily now have parity in Vision capability via typed input.
  • Precision: Typed queries can be more deliberate and precise — useful for technical asks like “extract the table of expenses and convert to CSV” or “find HEX codes for the color swatches in the shared design window.”

For IT admins and enterprises​

  • Adoption path: Text Vision lowers a barrier to experimentation — it’s easier to pilot with staff who prefer typed interaction and reduces the risks around accidental audio capture.
  • Policy and governance: The permissioned, session-bound nature of Vision is a positive default, but enterprise deployments should still audit Copilot settings, microphone permissions, and the Copilot app distribution channels. Administrators should confirm how conversation history, transcripts, and shared content are stored under corporate policy. Independent reporting shows Microsoft retains conversation history unless users explicitly delete it, and some transcripts may persist in account-level logs — this deserves review in sensitive environments.

Privacy, security, and data flow — what to watch​

Copilot Vision’s design choices attempt to balance utility with user control, but the real-world privacy implications depend on defaults and implementation details:
  • Session-bound sharing and consent: Vision requires you to choose what to share; it does not run continuously. That is the clearest privacy safeguard.
  • Local spotting vs cloud processing: For voice features, Microsoft uses a local wake‑word spotter with a short, transient in‑memory buffer that is not written to disk; heavier speech transcription and reasoning typically execute in the cloud unless you have Copilot+ hardware that runs richer models locally. For text Vision, the reasoning step will usually go to Microsoft’s cloud LLMs. Organizations should verify whether content is processed in region and review contractual terms for data residency and retention.
  • Persistence of artifacts: Copilot’s conversation history, exported files, and any transcriptions may persist in the user’s account or on-device storage depending on settings. Users should make conscious choices about Conversation History and where Copilot saves exported artifacts (OneDrive vs local). Enterprises need to map Copilot’s data flows to governance frameworks.
  • Accidental sharing risk: The visual glow that appears around a shared window reduces the risk of accidental exposure, but screen sharing mistakes remain possible. Treat Vision like any screen-share tool: verify the correct window, close confidential content in other apps, and use policies that restrict Copilot features on shared or publicly accessible endpoints.
Caveat: Microsoft’s public documentation and preview notes emphasize opt‑in behavior and session-scoped access, but independent audits of cloud retention policies and exact telemetry collection practices remain limited. Administrators should treat vendor claims as the starting point and require contractual evidence for higher-risk deployments.

Reliability, edge cases, and known limitations​

  • Highlights not present in text Vision: If your intended task relies on visual pointers that literally point on-screen to controls and fields, the current text Vision preview will not meet that need — the visual Highlights capability remains voice‑oriented for this release. Microsoft said it’s exploring how to integrate highlights with text Vision in future updates.
  • OCR and UI recognition accuracy: OCR has long matured for printed text; UI element recognition across varied third‑party apps is still uneven. Expect inconsistent results in less‑common apps or highly customized enterprise UIs.
  • Full-document context: For Office apps like Word, Excel and PowerPoint, Copilot Vision has previously claimed the ability to reason about full-document context when those files are shared. This capability can be powerful but raises questions about how far “full context” extends and whether hidden data (comments, metadata) is considered. Validate behavior in your environment before using Copilot for sensitive documents.
  • Latency and bandwidth: Because heavy reasoning still often runs in the cloud, Vision answers will vary in speed depending on network conditions. Machines flagged as Copilot+ with dedicated NPUs may get lower-latency experiences.

Cross-referencing and verification — what we checked​

To verify the update and contextual claims, reporting and Microsoft’s own Insider posts were reviewed. Microsoft’s Windows Insider blog has repeatedly described staged Copilot Vision features arriving to Insiders via the Microsoft Store (examples from April, May, July and October Insider posts documenting Vision, highlights, desktop sharing, and settings integration). Independent press outlets (The Verge, Windows Central, and major technology outlets) have also covered Copilot’s steady evolution — including document export, connectors to Gmail/Outlook, and the expansion of Vision and voice features — corroborating the staged rollout model and the direction of these features.
Important verification notes:
  • The announcement text the Insider blog distributed to Insiders (the material previewed to Insiders describing text Vision flows and the glasses icon interaction) was included in the preview materials we reviewed.
  • The specific package number cited in the preview (1.25103.107 and higher) appears in the Insider message, but independent tracking of Copilot package versions across the Microsoft Store and third‑party telemetry did not show a widely-corroborated public listing for that exact version string at the time of this article. Treat the exact package number as the vendor-supplied minimum in the announcement; Insiders should confirm the Copilot package version on their devices after the Microsoft Store update appears. This level of version verification typically requires direct access to Microsoft Store release metadata or to an Insider device showing the installed package.

Benefits and strengths​

  • Multimodal parity: Offering a text path for Vision brings parity between audio and typed inputs. That’s important for accessibility, shared working environments, and users who prefer typed interaction.
  • Lower friction for adoption: Typing removes social friction, increasing the likelihood Insiders will experiment with Vision in environments where voice would be disruptive.
  • Smooth modality switching: The ability to jump from typed conversation to voice mid-session without losing context is a thoughtful, modern UX pattern that enables richer workflows.
  • Session-level controls: The permissioned, session-bound model is sensible: users explicitly select windows to share and must stop the session when finished.
  • Integration continuity: Text Vision preserves integration with other Copilot features like export to Office and connectors (once enabled), making it a useful step in actual productivity workflows rather than a toy.

Risks and open questions​

  • Data residency and retention: Where exactly visual content and typed queries are processed and stored (region, retention period, backups) matters for regulated industries. The preview materials do not provide enterprise-grade retention guarantees; organizations should validate via contractual controls.
  • Accidental exposure: Sharing an entire desktop or the wrong window can leak sensitive data. The glow indicator is helpful but not foolproof; user training and policy controls remain necessary.
  • Third-party UI fidelity: Copilot’s ability to parse and act on non‑Microsoft UI elements will vary; automation and agentic features must be tested against critical apps.
  • Version-specific entitlements: Because Copilot features are delivered via package updates and gated server flags, enterprises may find inconsistent availability across fleets — important to consider for pilot programs.
  • Opaque telemetry: Microsoft references internal telemetry for engagement claims (for example, higher engagement for voice users). External researchers should evaluate real-world adoption and false-positive trigger rates for wake‑word and Vision sessions.

Practical guidance for Insiders and IT pilots​

  • Confirm your Copilot app package version after the Microsoft Store update; check the Copilot app About page and Windows Update/Store history. If you see package 1.25103.107 or higher (as Microsoft referenced in the preview), the text Vision features should be rolling out to your channel. If you do not see the update, be patient — Microsoft stages these releases.
  • Use a test profile or lab machine to validate behavior before enabling Copilot Vision broadly — especially testing:
  • What content is included in exports and whether metadata is preserved.
  • Where exported files are saved (OneDrive vs local).
  • How conversation history and transcripts are stored and how to delete them.
  • Lock down Copilot features via policy on shared endpoints:
  • Disable wake‑word listening where shared PCs are common.
  • Restrict Vision sharing on terminals that handle regulated data until governance is in place.
  • Train users on safe visual sharing:
  • Always confirm the correct window is selected.
  • Close unrelated documents that might contain sensitive information.
  • Prefer app-scoped sharing over full-desktop sharing when possible.
  • Collect feedback and telemetry during pilot:
  • Use the Copilot app’s in-product feedback flow (profile → Give feedback) to report behavior or gaps.
  • Capture representative screenshots and logs (redacted) to share with Microsoft if you encounter inaccuracies or privacy questions.

The bigger picture — product strategy and market implications​

Text-in Vision is not simply a convenience feature; it signals Microsoft’s intent to make Copilot a genuinely multimodal interface for Windows workflows. By lowering the friction to adopt Vision (typing instead of speaking) Microsoft broadens the potential user base and reduces barriers for enterprise pilots.
At the platform level, the move also ties back to Copilot+ hardware: Microsoft differentiates experiences between cloud-first devices and Copilot+ PCs with NPUs capable of local inference. Over time, richer local models could enable faster, privacy-friendlier Vision experiences, but today most heavy reasoning still goes to the cloud for typical devices. Beyond Windows, the trend reflects the industry’s push to normalize typed and voice interfaces with visual context, marrying the best of search, OCR, and modern LLM interfaces.

Final assessment​

The arrival of text-in/text-out for Copilot Vision in the Windows Insider channel is a pragmatic, user-focused improvement that expands where and how people can use on‑screen AI assistance. It improves accessibility, reduces social friction, and creates a more flexible multimodal assistant. The staged rollout and explicit session permissions are positive design choices. At the same time, important enterprise concerns — retention policies, data residency, and accidental exposure — remain unresolved in the preview materials and require careful piloting and contractual review.
Insiders should try the new mode to evaluate real-world reliability and user workflows, IT teams should pilot under test conditions and validate compliance boundaries, and Microsoft will need to accelerate transparency around data flows and enterprise configuration options before text Vision moves from Insider preview to general availability.
For now, the addition is a sensible step toward a Windows where typing, talking, and showing are equally powerful ways to get help — but organizations and privacy-conscious users must test and govern the feature before relying on it in production or regulated contexts.


Source: Microsoft - Windows Insiders Blog Copilot on Windows: Vision with text input begins rolling out to Windows Insiders
 

Microsoft has quietly expanded Copilot Vision on Windows to accept typed input as well as voice, letting Windows Insiders share an app or their desktop with Copilot and ask questions by typing while receiving responses in text inside the same Copilot chat window.

Blue holographic UI labeled 'Copilot Vision' with chat bubbles and a glowing avatar.Background​

Microsoft’s Copilot effort has steadily moved beyond a sidebar chatbot into a system-level assistant that can listen, see, and — in guarded previews — act. Copilot Vision debuted earlier in Insider channels as a voice-first capability for analyzing app windows and documents, offering OCR, guidance, and visual highlights. The new text-in/text-out addition makes Vision truly multimodal: typing becomes a first-class way to interact with the visual context Copilot receives.
This change is being delivered as a staged preview to Windows Insiders via the Microsoft Store in the Copilot app package noted by Microsoft (version 1.25103.107 and higher), so availability will vary by channel and device while Microsoft collects feedback.

What’s new: Vision with text‑in, text‑out​

  • Typed conversations for Vision: You can now start a Vision session and compose typed prompts about the content of a shared app, window, or desktop. Copilot replies in text within the same conversation pane rather than speaking aloud.
  • Modality switching: A single session can switch modes. Pressing the microphone icon converts a text Vision session into a voice session and preserves conversational context.
  • Simple UX flow: Start Copilot, click the glasses icon in the composer, toggle off the “Start with voice” option, select the app or screen to share (the selected window shows a visible glow), then type your questions. Stop sharing via Stop or X in the composer.
  • Current limitations: The initial text Vision preview does not include the Highlights overlays that visually point out UI elements, a feature Microsoft introduced earlier for voice Vision. Microsoft says it is iterating on how visual cues should integrate with typed conversations.

Step‑by‑step: how to try text Vision today (Insider preview)​

  • Update the Copilot app from the Microsoft Store and confirm the Copilot app version is at or above 1.25103.107 (if available on your device).
  • Open the Copilot composer (Copilot app or taskbar Quick view).
  • Click the glasses icon to start a Vision session.
  • Toggle Start with voice off to enable text-in/text-out.
  • Select an app window or your desktop; confirm the visual glow around the shared area.
  • Type a question in the composer; Copilot will analyze the shared content (OCR, UI parsing, context) and respond in text.
  • To resume spoken interaction, press the microphone icon and continue talking; the session will carry context forward.
  • End sharing by pressing Stop or X in the composer.

Technical and rollout verification​

Microsoft’s Windows Insider blog post announcing the feature is explicit about the UX flow, the package version minimum, and the staged rollout across Insider channels. The same details are reflected in other Copilot release notes and reputable reporting that covered the October preview wave of Copilot updates. These independent sources corroborate the package version and feature behavior described above.
A few specifics to verify before wide adoption:
  • The listed Copilot app package (1.25103.107) is the version Microsoft referenced for this staged preview; Insiders should check the Copilot app’s About panel or Microsoft Store history to confirm exact package numbers on their device. This is the authoritative way to confirm whether the staged update has reached a particular PC.
  • The rollout is server‑side gated and regional/channel‑dependent; not all Insiders will see the feature immediately. Treat absence of the option as expected behavior during a staged preview rather than a device failure.
  • Microsoft warns that Highlights — the visual overlays that point to UI elements — are not currently supported in the typed path; that remains an intentional limitation of this preview. If your use case depends on precise visual pointers rather than text descriptions, expect a feature gap for now.
If any claim (for instance, precise telemetry numbers or enterprise retention guarantees) is mentioned in third‑party writeups but not documented by Microsoft, treat those numbers cautiously and ask for confirmation from official release notes or Microsoft Support before relying on them in procurement or compliance decisions.

Why this matters — user benefits and real‑world scenarios​

  • Reduced social friction: Text Vision removes the need to speak aloud, making Vision usable in quiet settings like meetings, open offices, or shared spaces. Typing is a practical alternative that broadens the contexts where Vision is helpful.
  • Accessibility parity: Users who cannot or prefer not to use voice have a full path into Vision. This is meaningful for accessibility and personal preference, and it increases overall adoption potential.
  • Multitasking-friendly: Typed interactions are easier to skim, copy, and paste into notes or ticketing systems. Receiving text answers in the chat window simplifies follow-up actions like exporting into Office files.
  • Seamless modality switching: The ability to pivot from typed queries to spoken interaction mid-session is a modern UX pattern that supports fluid workflows — start typing at your desk, then switch to voice while you walk.
Example scenarios where text Vision helps immediately:
  • Summarizing a long PDF that’s open in a browser tab while in a quiet coworking space.
  • Extracting a table from an image or a screenshot and asking Copilot to convert it into a spreadsheet fragment.
  • Step‑by‑step guidance for complex settings screens where voice would be disruptive.

Enterprise implications: governance, privacy, and deployment​

This feature is promising for productivity but raises several governance questions that IT teams must address before broad enablement.
  • Permission model and session scope: Copilot Vision is session‑bound and requires explicit user selection of windows or desktop regions to share. That design limits accidental continuous capture, but it does not eliminate the risk of sharing the wrong window or sensitive content. Training and policy remain essential.
  • Data residency and retention: Microsoft’s preview posts describe session behavior but do not provide enterprise-grade retention or legal guarantees in the blog announcement itself. Organizations with regulatory obligations should validate data routing, retention, and deletion policies through contractual channels or Microsoft’s commercial documentation before allowing Vision on regulated endpoints. Treat claims about local processing or retention as implementation details that must be verified.
  • DLP and endpoint controls: Until Copilot Vision is covered explicitly by DLP (data loss prevention) or Intune policies in an organization’s management plane, endpoints may require additional configuration: restrict Vision use on devices handling regulated information, or enforce usage in monitored pilot groups only. Microsoft has historically shipped Copilot features via staged Store updates and server flags; this fragmentation affects enterprise planning and support.
  • Auditability: Administrators should ask how Vision sessions are logged, what conversation metadata is retained, and how transcripts can be exported or deleted. Without strong auditing hooks, adoption in sensitive environments will be risky. If Microsoft hasn’t documented audit and retention controls for this preview, flag that as a deployment blocker for regulated systems.
  • Heterogeneous availability: Because Copilot features can be tied to device entitlements (for example, richer on‑device models on Copilot+ PCs), expect the user experience to vary across an enterprise fleet; plan pilot groups and hardware baselines accordingly.

Risks, limitations, and what Microsoft still needs to prove​

  • Missing Highlights in the typed flow: For workflows where pointing to specific UI elements matters — e.g., guided training or troubleshooting — the lack of visual overlays in text Vision is a real limitation. Microsoft has left this capability out of the initial typed preview to validate the typing flow first; expect it to return later but do not assume parity yet.
  • Potential for accidental data exposure: Sharing a desktop or the wrong window can reveal sensitive information. The glow indicator helps, but human error is common; enterprises should treat Vision like any screen‑sharing feature and build training and policy around it.
  • Cloud dependency and latency: Many Vision capabilities rely on cloud processing unless running on hardware that supports robust on‑device inference. For latency‑sensitive or air‑gapped deployments, validate whether Copilot workloads can be configured to meet requirements. Microsoft differentiates Copilot+ hardware for lower latency/local inference; this produces an experience gap across devices.
  • Opaque telemetry claims: Microsoft sometimes references internal telemetry (for example, higher voice engagement), but independent verification of such metrics is rare. Treat metrics quoted in marketing materials as directional rather than definitive until external analysis is available.
  • Feature fragmentation and entitlements: Copilot features have a history of being gated by region, channel, or license. Expect the text Vision experience to mature gradually; do not plan critical workflows on it until it reaches general availability and enterprise‑grade controls are documented.
Where claims are not fully documented in Microsoft’s blog (for example, precise retention windows or default transcript deletion behavior in enterprise tenants), those points are flagged as unverifiable in this preview and should be validated with Microsoft support or contractual terms.

Practical guidance for Insiders and IT pilots​

  • For Windows Insiders: Install or update the Copilot app via the Microsoft Store, check the app About page for package version 1.25103.107 or higher, and test text Vision on non‑sensitive content. Provide feedback through the Copilot app’s Give feedback flow to help Microsoft refine UX and privacy controls.
  • For IT teams planning pilots:
  • Start with a small, managed pilot group and restrict Vision to non‑sensitive endpoints.
  • Validate DLP coverage and audit logs for Copilot activity; if logs are lacking, delay broad enablement.
  • Confirm the Microsoft 365/Copilot licensing entitlements that affect export and connector features (some capabilities are gated by Copilot or M365 licensing).
  • Document training materials that show the correct window selection flow and highlight the risk of sharing entire desktops.
  • Feedback checklist for pilot reporting:
  • Accuracy of OCR and UI parsing across your critical apps.
  • Instances where missing Highlights reduced usability.
  • Latency and reliability across typical network conditions.
  • Any unexpected data residency or retention behaviors.
  • Compatibility issues with corporate endpoint protection or DLP.

The strategic view: why Microsoft is doing this​

Text Vision is more than a convenience tweak. By making Vision usable without voice, Microsoft lowers adoption friction and broadens Copilot’s reach into everyday productivity scenarios — an important step if Copilot is to become a default interaction layer on Windows. This aligns with Microsoft’s broader strategy to make voice, vision, and agentic actions first‑class inputs in Windows while differentiating richer, lower‑latency experiences on Copilot+ hardware. The move also positions Copilot as a bridge between ad‑hoc on‑screen assistance and structured productivity flows (exporting to Office, connectors to inboxes and drives).
That said, strategic intent does not remove operational responsibilities. Microsoft must continue to close the feature parity gap (Highlights in typed flows), provide straightforward admin controls, and publish clear enterprise documentation about data handling before corporations should enable Vision at scale.

Conclusion​

The addition of text-in, text-out to Copilot Vision is a pragmatic and overdue evolution that turns a voice‑dominant experiment into a truly multimodal Windows assistant. For end users, it unlocks Vision in quiet, shared, or accessibility‑sensitive contexts. For IT leaders, it presents clear productivity potential but also finite governance and privacy challenges that require measured pilots, DLP validation, and clear auditability before enterprise rollout.
Windows Insiders can try the feature now by updating the Copilot app (watch for package 1.25103.107 and higher) and toggling Start with voice off in the glasses composer; if the option is not yet visible, Microsoft’s staged rollout means patience will likely be the only requirement. Test on non‑sensitive content, collect feedback, and track Microsoft’s subsequent updates for Highlights parity and enterprise controls before expanding deployment.

Source: Windows Report Copilot Vision on Windows Now Supports Text Input
 

Back
Top