Copilot Vision Adds Text Input for Windows Insiders Preview

  • Thread Author
Dark UI mockup of Copilot showing a scanned document and OCR information panels.
Microsoft has quietly extended Copilot Vision on Windows to support a typed input mode, letting you share app windows or your desktop with Copilot and type questions about what it sees — a capability now rolling out to Windows Insiders as a staged preview that broadens where and how Vision can be used.

Background / Overview​

Copilot on Windows has been evolving quickly from a chat sidebar into a system-level assistant that can listen, see, and (in controlled previews) perform multi-step actions. Microsoft introduced Vision features earlier this year that let Copilot analyze shared app windows and camera feeds to extract text (OCR), identify UI elements, and provide guided assistance (the “Highlights” feature). Those early Vision sessions were primarily voice-first: you’d click the glasses icon, share a window, and speak your question while Copilot responded aloud. Microsoft positioned Vision as a permissioned, session-bound capability — Copilot only sees what you explicitly share — and has been rolling Vision and related features through the Windows Insider program.
The text-in/text-out addition removes voice as the sole entry point for Vision sessions. That matters because many real-world situations make voice impractical: open-plan offices, meetings, video calls, or simply moments when users prefer silent interactions. Typing preserves the context benefits of Vision — that is, the assistant has a live view of an app or document — while avoiding the need for microphone input or audible responses.

What changed: Text-in / Text-out for Copilot Vision​

  • Copilot Vision now supports a typed conversation path: you can initiate a Vision session, type questions about a shared app or screen, and receive text replies in the Copilot chat pane rather than spoken responses.
  • The modal flow remains flexible: you can switch mid-session from text mode to voice mode by pressing the microphone icon, preserving conversation context when you change input preference.
  • Some visual features demonstrated previously — notably the Highlights overlays that point to UI elements — are not yet available in the initial text-mode preview. Microsoft says it’s iterating on how visual cues integrate with a typed conversation.
  • Microsoft is delivering the update as a staged preview through the Microsoft Store to Windows Insiders; availability is being gated by channel and server-side feature flags. Several community reports and Insider guidance note that the rollout will appear gradually across devices and channels.
These changes convert Vision from a voice-first experiment into a genuinely multimodal assistant where typing, speaking, and visual context are all first‑class inputs.

The specific user flow (practical summary)​

  1. Open the Copilot composer in the Copilot app or the Copilot quick view.
  2. Click the glasses icon to start a Vision session.
  3. Toggle off the “Start with Voice” (or similar) setting to choose the text-in/text-out path.
  4. Select the app window or desktop region to share — the UI highlights shared windows with a visible glow so you can confirm the selection.
  5. Type questions in the Copilot chat composer; Copilot will analyze the shared visual content (OCR, UI parsing, contextual reasoning) and reply in text.
Note: users can revert to voice at any time by pressing the mic icon; the session remains intact and Copilot continues the conversation using the selected modality.

Technical verification and rollout details​

Microsoft’s official Insider updates and Copilot release notes document the incremental rollout of Vision features (highlights, desktop share, multi-app share) through the Microsoft Store for Insider channels. The Windows Insider blog and the Copilot release notes are the primary sources for features like Highlights, Desktop Share, and the step‑by‑step guidance on starting Vision sessions.
A community report and press coverage — including aggregated Insider notes — indicate the text-in/text-out update is being shipped as a staged preview. Some outlets and community logs reference Copilot app package identifiers tied to staged releases; however, package numbers reported in third‑party summaries (for example, a referenced package number used by some Insider previews) should be treated cautiously unless confirmed in Microsoft release notes or the Microsoft Store package metadata. The staged approach means the feature may appear for some Insiders sooner than others, and Microsoft can activate or disable server-side flags during the preview.
Caveat: one specific package version number (reported by some community summaries) is not directly documented in Microsoft’s public release notes at the time of reporting; treat precise package numbers as likely indicative but subject to verification by checking the Copilot app’s About page or Microsoft Store history on the device.

UX implications: how text Vision feels and where it shines​

Text-in Vision improves accessibility and situational flexibility. For users who:
  • Work in shared offices, coffee shops, or meetings, text input avoids the social friction of speaking aloud.
  • Have limited or no microphone hardware, typed input gives full access to Vision’s context-aware assistance.
  • Prefer readable transcripts for record-keeping or exporting, text responses are easier to copy, export, and paste into documents.
Practical benefits:
  • Faster, silent Q&A about on-screen content: extract an invoice table from a PDF and ask Copilot to convert it into Excel rows — all via typed prompts.
  • Better accessibility for users with auditory processing or attention differences who rely on screen text rather than spoken audio.
Limitations to note:
  • Visual Highlights (the overlays that physically point to UI elements) are not present in the initial text-mode release. That reduces usefulness for tasks that depend on precise, step‑by‑step pointing. Microsoft says it will iterate on integrating visual cues with a text conversation.
  • The assistant’s interpretations still use OCR and UI parsing, which are subject to the usual failure modes (poorly formatted scans, dynamic UIs, overlapping overlays). Users should verify critical outputs.

Privacy, data handling, and security — what IT teams must know​

Copilot Vision’s design choices try to balance utility with explicit consent and session scoping, but there are several operational and policy items IT teams should verify before broad deployment.
  • Session-bound sharing: Vision runs only after explicit user action to share a window or desktop, and the UI signals which window is being shared. This design reduces accidental, continuous capture.
  • Data retention: Microsoft’s support guidance indicates Vision captures visual inputs only while sessions are active and that session images are not logged or stored long-term in the same way chat transcripts are — though transcripts and assistant responses can be persisted in conversation history unless a user deletes them. Administrators should validate current retention semantics in their tenant and test the experience to confirm how exports and history behave.
  • Cloud vs local processing: Many Vision features rely on cloud reasoning. Microsoft has introduced a Copilot+ hardware tier with NPUs for on-device inference, but most devices will still route heavier reasoning to the cloud, raising questions about data flow and residency for regulated workloads. Validate whether your environment’s compliance posture permits cloud handling of screen content.
  • DLP and governance: Enterprises should pilot Vision with strict policies: disable or limit Vision on shared or regulated endpoints, enforce app-scoped sharing over full-desktop sharing, and ensure DLP policies or Conditional Access controls prevent accidental sharing of sensitive content. Microsoft’s rollout notes recommend lab testing and gradual expansion for production environments.
Security checklist for pilots:
  • Confirm the Copilot app and Microsoft Store package version on test machines.
  • Test what information is included in exported files and how metadata is preserved.
  • Validate deletion flows for transcripts and session data.
  • Restrict wake-word and Vision features on shared terminals via policy until governance is in place.

Enterprise adoption: practical rollout recommendations​

  1. Start with a constrained pilot: choose a cross-functional group (IT support, documentation writers, a few product teams) and enable Vision text-mode only on dedicated test machines. Monitor feedback and telemetry.
  2. Validate DLP and logging: confirm that Copilot’s sharing UI doesn’t leak incidental data and that exports go to approved storage locations (OneDrive for Business, managed network shares, etc.).
  3. Update policies and training: create a short training module on “how to share safely” and add a checklist for users before initiating a Vision session (close unrelated documents, confirm selected window).
  4. Integrate feedback loops: collect structured Insider feedback and incident reports so the organization can adapt deployment plans as Microsoft iterates on the feature.
For organizations with strict data residency or regulatory constraints, restrict Vision until Microsoft publishes clearer enterprise controls that meet compliance requirements.

UX and developer considerations​

  • Modal switching: the ability to swap from typed input to voice mid-session is a UX win. It preserves context and avoids losing work when a user needs to shift modes.
  • Highlights gap: removing Highlights from the initial text-mode preview likely reflects Microsoft’s desire to validate the typed flow and permission model before enabling overlays that require precise UI access. That conservative approach reduces the chance of regressions but delays parity between voice and text experiences.
  • Developer opportunity: app and extension developers should consider how their UIs appear in a Vision session (contrast, accessibility, semantic labels). Good UI semantics (clear labels, accessible names) will help Vision’s UI parsing and OCR produce better results.

Risks, limitations, and verification notes​

  • Model hallucination and OCR errors remain real risks. Vision’s interpretations are only as good as the captured visual data; users must validate extracted tables, transcriptions, and suggested UI steps before acting on them.
  • Staged rollouts and server-side flags mean functionality and package identifiers can shift rapidly. Reported Copilot app package numbers in community summaries — while useful as indicators — should be verified locally via the Copilot app About page or Microsoft Store metadata before relying on them for troubleshooting or deployment.
  • Enterprise governance features (DLP integration, centralized admin controls) lag the consumer preview cadence. IT organizations must pilot carefully and demand clearer admin controls from Microsoft before broad adoption.

How this fits into Microsoft’s broader Copilot roadmap​

The text-in/text-out Vision expansion is part of a piano‑key series of investments Microsoft is making across Voice, Vision, and Actions:
  • Voice: opt‑in wake-word (“Hey, Copilot”) and persistent voice sessions on unlocked PCs.
  • Vision: screen-aware assistance, Highlights, multi-window and desktop share, and now text-mode Vision for quieter environments.
  • Actions / Agents: experimental agentic features (Manus and Copilot Actions) that can execute chained tasks with explicit permissions.
Microsoft’s product strategy also includes a Copilot+ hardware tier that pairs CPUs, GPUs, and NPUs to enable more on‑device inference (the company has signaled performance guidance for NPUs measured in TOPS). That hardware differentiation is intended to reduce latency and improve privacy for latency-sensitive workloads; however, most current Copilot Vision processing still goes through cloud models on typical hardware. Organizations should weigh device refresh costs against privacy and latency requirements as they plan pilots.

Quick start checklist for Insiders and enthusiasts​

  • Confirm Copilot app is updated; check the app About page and Microsoft Store history for the Copilot package installed on your device.
  • Open the Copilot composer, click the glasses icon, toggle off “Start with Voice,” select an app or desktop region, and begin typing questions to Copilot.
  • If Highlights are essential for your workflow, expect those overlays to arrive later in parity with the voice path.
  • Use non-sensitive test content for initial evaluation and file exports. Validate retention, export locations, and transcript deletion behavior.

Conclusion​

Adding typed input to Copilot Vision is a pragmatic and meaningful expansion: it makes screen‑aware AI assistance usable in quiet environments, by users without microphones, and in situations where an on-screen transcript is preferable to spoken output. The initial preview thoughtfully preserves session consent and offers fluid switching between text and voice, but it also exposes the trade-offs Microsoft faces — particularly missing visual overlays in the first text‑mode release, cloud-based processing for many devices, and a need for stronger enterprise governance.
For Windows Insiders and early adopters, the right approach is methodical: verify package versions locally, pilot with contained datasets, and focus on workflows where the typed modality clearly reduces friction. For IT and security teams, the prudent posture is cautious, governed testing: confirm DLP and retention behavior, validate exports, and withhold broad enablement until Microsoft offers more granular controls suited to regulated environments.
The text‑in/text‑out path converts Copilot Vision from a voice-first novelty into a flexible, multimodal tool that can better match how people actually work — provided Microsoft continues to close the remaining usability, privacy, and enterprise‑control gaps during the preview.

Source: Neowin You can now use Copilot Vision on Windows with text input
 

Back
Top