Copilot Vision Text In Text Out: Windows Insider Preview Enables Typed Conversations

  • Thread Author
Microsoft has begun rolling out a staged Insider preview that finally gives Copilot Vision a typed conversation path — you can now share an app or your screen with Copilot, type questions about what it sees, and receive text replies inside the same Copilot chat window (the update is delivered via the Microsoft Store as Copilot app version 1.25103.107 and higher).

A person types at a desk while the monitor shows Copilot chat explaining text recognition.Background​

Microsoft has steadily moved Copilot from a sidebar curiosity into a system-level assistant in Windows that can listen, see, and (in controlled previews) act on user intent. The most recent wave of updates bundles three interlocking ideas: Voice (an opt‑in wake word and persistent voice sessions), Vision (permissioned screen or camera sharing with OCR and UI awareness), and Actions (experimental, permissioned agent workflows). Many of those capabilities first appear in Windows Insider previews and in incremental Copilot app package updates distributed through the Microsoft Store.
Copilot Vision has already supported voice-first, coached interactions where Copilot would speak the analysis aloud. The new text‑in/text‑out mode addresses scenarios where speaking aloud is inappropriate or inconvenient — meetings, public spaces, or simply when users prefer typing. Microsoft describes Vision sessions as session‑bound and permissioned: Copilot only sees what you explicitly share for that session.

What Microsoft shipped in this update​

The headline: Text-in, text-out for Copilot Vision​

  • Text-in / text-out Vision: You can now start a Vision session where you type questions about a shared app or screen, and Copilot replies in text in the same conversation pane. This transitions Vision from a voice-first experiment into a fully multimodal capability with typed input as a first-class citizen.
  • How to start: Open the Copilot composer, click the glasses icon, toggle off Start with voice, and pick the app or screen to share. The chosen window shows a visible glow to indicate it is being shared. Type a question and Copilot will reply in-text. Press the microphone button to convert the session mid‑conversation back into voice Vision. Stop sharing by pressing Stop or the X in the composer.
  • Package and rollout: The feature is included in Copilot app package 1.25103.107 and higher, and Microsoft is staging the rollout to Windows Insider Channels via the Microsoft Store, so availability will vary by channel and device.

Practical limitations in this preview​

  • Visual Highlights are limited: Some previously demonstrated capabilities — notably the visual Highlights that point to UI elements on the screen — are not supported in this early text‑vision release. Microsoft says it is continuing to iterate on how visual cues integrate with text-based conversations.
  • Staged availability: Because Microsoft is rolling this out gradually, not every Insider will see the update immediately; server-side feature flags or regional gating apply.

How it works — step‑by‑step user flow​

  • Open the Copilot app or the Copilot composer in Windows.
  • Click the glasses icon in the composer.
  • Toggle off Start with voice to enter the text Vision path.
  • Select the app window or desktop region you want to share; the selected area will glow to confirm sharing.
  • Type your questions in the chat composer; Copilot analyzes the shared content (OCR, UI parsing, contextual reasoning) and replies in text.
  • If you prefer voice at any point, press the mic icon to transition the existing session into voice Vision without losing the conversation context.
This flow is intentionally simple: the UI signals what’s being shared, keeps sharing scoped to an explicit user action, and supports fluid transitions between typed and spoken conversation modalities.

Why this matters: practical benefits​

Accessibility and etiquette​

  • Accessibility parity: Text input ensures that users with hearing or speech limitations — or those who use assistive input methods — can access Vision features without relying on voice. This levels the playing field for visually- or speech-impaired users who need on‑screen analysis.
  • Public/private etiquette: In shared spaces, meetings, or noisy environments, typing avoids the social friction and privacy concerns of speaking aloud. The text Vision path provides the same context-aware help without audio output.

Productivity and workflow smoothing​

  • Fewer context switches: Instead of copying text from a PDF, taking screenshots, or toggling between windows, Copilot Vision lets you share a window and get immediate extraction, summaries, or UI guidance — saving time on repetitive tasks.
  • Multimodal continuity: The ability to switch mid-session between typed and voice input creates a fluid, conversation-like interaction that fits different tasks and settings. This can help with iterative problem solving (for example, refine a summary, then ask Copilot to export it to Word).

Critical analysis: strengths and opportunities​

Strengths​

  • Modality flexibility: Making text a first‑class input for Vision is a natural and overdue enhancement. It respects real-world variation in how people work and increases Copilot’s utility across contexts.
  • Clear permission model: The session‑bound sharing and visible glow around shared windows are good UX signals that reduce accidental exposure; explicit consent for each session is a strong default.
  • Integration surface: Copilot Vision’s ability to export extracted content to Office apps and integrate with File Explorer or the taskbar creates meaningful productivity touchpoints — not just answers but actionable artifacts.

Opportunities for Microsoft​

  • Bring Highlights to text Vision: Visual pointers that indicate UI elements (Highlights) are powerful for troubleshooting and guided tasks. Restoring or reimagining Highlights within a text-centred conversation would significantly improve usability for many scenarios.
  • Enterprise controls and DLP integration: Enterprises will want precise policy controls and Data Loss Prevention (DLP) integration so shared Vision sessions cannot exfiltrate sensitive information. Tight integration with Microsoft Purview / DLP policies should be prioritized.

Risks, unknowns, and governance considerations​

Privacy and audio handling​

Microsoft’s overall voice design uses a small on‑device spotter that listens for the wake phrase and keeps a short transient audio buffer (reported in preview documentation to be around 10 seconds) that’s discarded unless a session starts. After activation, heavier processing generally moves to cloud services unless the device is certified as Copilot+ with local NPUs. These architectural choices reduce continuous recording risk, but they do not eliminate cloud exposure for non‑Copilot+ devices.
This design is sensible from a usability and latency point of view, but it raises real questions about:
  • Where audio is routed and retained after a session.
  • How customers can audit or delete session data.
  • Whether transient buffers can be exploited by malicious actors or apps on shared devices.

Model reliability and hallucination risk​

Vision features combine OCR, UI parsing, and LLM reasoning. That stack can produce convincing but incorrect extractions or suggestions — hallucinations — especially when UI elements are ambiguous, text is low‑quality, or context is incomplete. Users should explicitly verify critical outputs (invoices, legal text, configuration changes) before acting on them. The product is useful, but fallibility remains a core limitation.

Surface expansion and attack vectors​

Adding screen-sharing plus typed/voice input expands the assistant’s privileges and the potential attack surface:
  • Malicious apps could try to trick users into sharing sensitive windows.
  • Agentic workflows (Copilot Actions) increase risk if agents are allowed to click and act across apps without strong guardrails. Microsoft frames Actions as experimental and permissioned, but enterprise control and auditing will be essential before wide adoption.

Company-sourced claims that need independent verification​

Microsoft has cited internal telemetry showing higher engagement for voice interactions (roughly twice as many engagements as text in early company metrics). That figure is company-provided and should be treated as directional until independently verified by broader telemetry studies. Similarly, any specific performance claims tied to Copilot+ hardware should be validated against independent benchmarks.

Technical verification of key claims (what’s been confirmed)​

  • The Copilot app update providing text-in/text-out Vision is being previewed to Windows Insiders now and is included in version 1.25103.107 and higher. This is distributed via the Microsoft Store as a staged rollout.
  • The UI flow to start text Vision uses the glasses icon in the Copilot composer, a Start with voice toggle to switch modalities, a visible glow that marks the shared window, and a mic button that transitions an ongoing text Vision session to voice. These elements are documented in the Insider announcement and confirmed by early hands-on reports.
  • The Highlights capability that visually points to UI elements is not available in this initial text Vision release; Microsoft plans to evaluate how to restore or rework those visual cues. This is explicitly called out in the preview notes.
  • Microsoft’s voice architecture uses an on‑device spotter and a transient in‑memory audio buffer (~10 seconds reported in preview notes) before forwarding audio for cloud processing; Copilot+ devices can offload more processing locally. This hybrid model is repeatedly referenced in preview documentation and technical reporting.
  • Microsoft has introduced the concept of Copilot+ PCs, a hardware tier that pairs CPU/GPU with dedicated NPUs for lower-latency and more private on-device inference; published previews and reporting suggest a practical baseline in the neighborhood of 40+ TOPS for many advanced local experiences, though the exact requirement depends on workload and vendor implementation. Treat specific TOPS thresholds as approximate pending vendor documentation.

Recommendations for Insiders and IT admins​

For Windows Insiders (hands-on testers)​

  • Try the flow with non-sensitive windows first: test how Copilot extracts tables, summarizes documents, and answers UI questions before using it on production materials.
  • Report feedback directly through the Copilot app using Profile → Give feedback; Microsoft is explicitly soliciting Insider feedback to refine the experience.
  • Evaluate the practical value of text Vision versus voice in your workflows — use the mic transition feature to compare modalities during the same session.

For IT administrators and security teams​

  • Treat Vision sessions as a potential DLP risk: inventory which endpoints and users should have Vision enabled, and apply policies (or block settings) where sensitive content might be exposed. Integrate Copilot usage into existing DLP and monitoring workflows.
  • Consider disabling the wake‑word on shared or kiosk devices; keep the wake‑word opt‑in for personal devices only. Confirm microphone access and audit logs for Copilot sessions where compliance matters.
  • Pilot Copilot Actions only in controlled rings and require explicit approvals for any agentic automation that touches enterprise systems. Agents increase the need for operational logging and rollback controls.

How Microsoft can improve the experience (priority items)​

  • Restore or reimagine Highlights for text Vision so the assistant can point users to exact UI elements even when responses are delivered as text. Visual cues and inline text annotations should work together to reduce ambiguity.
  • Provide enterprise-grade policy controls to block or quarantine Copilot Vision sessions on managed endpoints, and expose comprehensive auditing (who shared what, when, and whether content was forwarded to cloud services).
  • Publish clear and granular data retention and deletion policies for Vision and voice sessions, along with exportable session logs for compliance teams. This transparency will be crucial for corporate uptake.
  • Release developer and admin documentation that clarifies Copilot+ hardware requirements and the practical performance/latency benefits on certified devices; avoid vague TOPS ranges that are hard to translate into procurement decisions.

The rollout context: staged testing and product strategy​

Microsoft’s pattern for Copilot features is incremental — ship to Insiders, gather feedback and telemetry, revise the product, then broaden availability. Text Vision’s initial absence of Highlights suggests Microsoft wants to validate the typing flows and permission model before tackling more complex visual overlays. This conservative approach reduces the chance of mass regressions or privacy mistakes during a broad consumer rollout, but it also means enterprises and mainstream users should expect a phased schedule for feature parity between voice and text experiences.
Positioning Copilot as a system-level assistant and pairing it with a Copilot+ hardware tier is a strategic choice: it gives Microsoft a story to sell to OEMs and customers who value low-latency and privacy-oriented inference. However, tying the richest experiences to newer hardware risks fragmenting the experience across the Windows installed base. Expect Microsoft to balance that by keeping baseline features available via cloud processing while reserving latency‑sensitive or offline capabilities for Copilot+ devices.

Final assessment​

This update is an important, pragmatic step: it recognizes that voice cannot and should not be the only way Copilot interacts with visual context. The text-in / text-out mode makes Copilot Vision usable in a far wider set of real‑world situations — from open offices and meetings to accessibility use-cases. The ability to switch modalities mid-session demonstrates thoughtful UX design and reflects a mature understanding of how people actually work.
Yet the preview also exposes the work that remains. Missing Highlights, reliance on cloud processing for many workloads, and the need for robust enterprise governance and DLP integrations are real limitations for organizations. The hybrid voice architecture and Copilot+ hardware tier show Microsoft is thinking through privacy and performance tradeoffs, but specific claims (engagement uplift, TOPS thresholds) warrant independent verification and clearer documentation.
For Windows Insiders, this is a practical feature to test: try it on non-sensitive content, compare typed and voice sessions, and send feedback through the Copilot app. For IT leaders, the right posture is cautious pragmatism: pilot in constrained groups, verify DLP and auditing coverage, and wait for clearer admin controls before broad deployment.
Microsoft’s staged preview will determine whether text Vision is a polished, reliable productivity tool or a promising but premature convenience. The initial signals are positive: the UI is explicit about consent, the feature solves a clear user need, and Microsoft is soliciting Insider feedback. What remains is measured execution — restoring missing visual aids, hardening enterprise controls, and proving reliability at scale.

Copilot Vision with typed input is rolling out to Insiders now; testers should look for Copilot app updates in the Microsoft Store and expect the features to appear gradually across Insider channels as Microsoft refines the experience.

Source: Microsoft - Windows Insiders Blog Copilot on Windows: Vision with text input begins rolling out to Windows Insiders
 

Back
Top