• Thread Author
A recent hands‑on experiment that tried to replace Microsoft Copilot’s web‑page summarization with a fully local stack — Ollama running local models and the Page Assist browser sidebar — ended with a clear, practical verdict: Copilot still delivers the faster, more polished experience for everyday web summarization, while the local alternative shows huge promise but remains a work in progress. The test highlights two competing trends in the AI‑assisted browsing space: the convenience and conversational polish of cloud‑hosted assistants, and the privacy, control, and experimentation advantages of local LLM workflows.

Background / Overview​

In the last two years the AI tooling landscape has split into two complementary approaches:
  • Cloud‑first, tightly integrated assistants such as Microsoft Copilot (embedded in Edge and Windows) that leverage large, centrally hosted models plus product‑level integration to provide contextual, conversational experiences.
  • Local LLM stacks — tools like Ollama, LM Studio and browser front ends such as Page Assist — that let users run models on their own machine, use custom embeddings for retrieval, and keep data on‑device.
The Windows Central experiment plugged Page Assist (a browser sidebar and web UI for local LLMs) into an Ollama backend, added an embedding model for retrieval, and tried to replicate Copilot’s most valuable everyday feature: summarizing the web pages you’re actively reading, in a fast, actionable way. The result: Page Assist + Ollama worked, but did not match Copilot for speed, conversational quality, or contextual follow‑ups — the same tasks where most professionals measure productivity gains.

The stack tested: what was used and why​

Components used (names and versions treated as labels)​

  • Page Assist — a browser extension and web UI that exposes a sidebar chat which can interact with the current web page and a local model provider. It supports Ollama, LM Studio, and OpenAI‑compatible endpoints. Page Assist provides a context‑aware sidebar that can fetch page content, run OCR on images, and manage models. (pageassist.xyz, docs.pageassist.xyz)
  • Ollama — a popular local LLM manager that runs models on the host machine and exposes a local API (default localhost:11434) that Page Assist can detect and use. Ollama hosts model artifacts and provides a GUI and CLI to run/pull models.
  • nomic‑embed‑text — a retrieval embedding model recommended for RAG (retrieval‑augmented generation) workflows; Page Assist can use this via Ollama to build embeddings for page content. Nomic’s Embed Text family is specifically tuned for retrieval tasks and has local inference options and V1/V2 iterations. (ollama.com, home.nomic.ai)
  • Local LLMs — the test used a variety of models (small to large). Notably, the OpenAI gpt‑oss family (local, open‑weight releases such as gpt‑oss:20b) is now widely available through Ollama and Hugging Face; such models can be used for local chat and reasoning. Larger models gave better summaries, but also required more compute. (ollama.com, huggingface.co)
Page Assist serves as the front end that “talks” to your Ollama models, and the experiment required adding an embedding model into Ollama (nomic‑embed‑text:latest) so Page Assist could build a simple RAG index for the page content. The extension’s sidebar can be opened via the context menu or a keyboard shortcut, letting users ask questions about the open web page. Page Assist also exposes settings for managing models and will auto‑detect a local Ollama instance if it’s on the default endpoint.

How retrieval‑augmented generation (RAG) was used — short primer​

Retrieval‑augmented generation is the glue that lets smaller LLMs act like they "know" a website’s content without re‑training them. The workflow in this test was:
  • Page Assist extracts the web page content (text and images, with OCR where needed).
  • The content is chunked and sent to the embedding model (nomic‑embed‑text) running in Ollama.
  • Embeddings are stored and used to retrieve relevant chunks for any incoming user query.
  • The local LLM is given those retrieved chunks as context and asked to produce a summary.
This is a common way to make on‑device models behave like a context‑aware assistant, and it’s exactly what Page Assist instruments in its sidebar flow. The embedding model and the retrieval layer are essential; without them the model must rely on a small sliding window of context and will be far less reliable at summarizing long pages. (registry.ollama.com, docs.nomic.ai)

The results — what worked and what didn’t​

What worked well with the local stack​

  • Page Assist reliably detected Ollama and allowed direct model management (pulling models, switching providers) from within the extension — a significant usability gain over manual CLI workflows. The integration makes local LLM experiments accessible to non‑terminal users. (docs.pageassist.xyz, github.com)
  • Embeddings + RAG produced coherent summaries when paired with larger local models. With the right retrieval chunks, a 20B model could distill and paraphrase article sections into compact summaries.
  • Image/vision support was present when models with vision capabilities were used; Page Assist’s sidebar can run OCR and surface image‑related responses from vision‑enabled LLMs. This allows page image analysis or extracting text from screenshots.

Where the local stack fell short​

  • Context persistence (cross‑page bleed): the sidebar kept conversation state across page changes unless the user explicitly started a new chat or told the model to forget the prior page. That meant running the same “summarize this page” prompt on multiple pages often produced repeated, stale output tied to the earlier article. Page Assist’s “temporary chat mode” mitigates some of this behavior, but it’s a UX detail that slows speed reading workflows.
  • Quality and conversational polish: Copilot’s summaries tend to be more conversational, offering clearer breakdowns and follow‑up suggestions — small prompts the assistant surfaces to deepen the user’s angle on a story. Local model outputs were more matter‑of‑fact and sometimes missed the editorial framing that makes a quick summary immediately actionable. Microsoft’s own Copilot documentation highlights the ability to summarize pages and then continue the chat for deeper questions. (microsoft.com, support.microsoft.com)
  • Efficiency and latency: larger local models require significant computing resources. Even where a 20B model was usable, response times on consumer GPUs were noticeably slower than Copilot’s cloud‑hosted latency for the same task, especially when RAG retrieval and OCR were involved. The hardware and memory cost for large context windows can become a bottleneck. (ollama.com, cookbook.openai.com)

Head‑to‑head: why Copilot still wins for web summarization​

Three concrete advantages give Copilot a practical edge for the average user who needs fast, reliable page summaries:
  • Integrated, page‑aware UX: Copilot’s Edge integration is built to read the current tab, request permission to access browsing context, and then summarize immediately in a conversational pane. That integration is smooth: the assistant is part of the browser workflow rather than a peripheral sidebar that needs manual state management. Microsoft’s documentation lays out the exact workflow and permission model Copilot uses to “ground” responses in page content. (support.microsoft.com, microsoft.com)
  • Polish and follow‑ups: Copilot surfaces follow‑up questions and alternative summary angles — a small but powerful UX feature that often triggers creative or investigative directions that the user hadn’t considered. This makes Copilot not just a summarizer, but a discussion partner for fast browsing. Independent coverage and tutorials call out this conversational advantage as a primary productivity win. (tomsguide.com, lifewire.com)
  • Server‑side compute + model orchestration: Microsoft hosts models that are frequently tuned and benefit from massive scale and multi‑model orchestration. That lets Copilot produce higher fidelity summaries with lower latency than many consumer GPUs can achieve locally, particularly when you factor in retrieval, vision OCR, and long contexts.
That said, Copilot’s advantages are practical, not absolute. A well‑configured local stack can match or exceed Copilot in specific scenarios — for example, when privacy expectations or offline operation are paramount.

Strengths of the local approach (what local LLMs do well)​

  • Data sovereignty and privacy control: With everything running on‑device (models and embeddings), sensitive content never needs to leave the machine. This is a decisive advantage for regulated environments or users who explicitly need to limit third‑party telemetry.
  • Customizability and fine‑tuning: Local models can be fine‑tuned, patched, or chained with domain‑specific retrieval stores to deliver highly tailored summaries that reflect organization‑specific terminology.
  • Experimentation velocity: Tools like Ollama and Page Assist dramatically reduce the setup friction for trying new models and retrieval strategies, enabling rapid iteration for power users and researchers. Pulling models, swapping embedding backends, and running different RAG templates can all happen without cloud costs. (github.com, registry.ollama.com)
  • Offline/edge capability: When internet access is constrained, local stacks remain functional — a major plus for travel, fieldwork, or secure facilities.

Risks, limitations, and practical caveats​

  • Hardware cost and energy: High‑quality local LLMs (20B+ or vision models) require modern GPUs or NPUs and substantial RAM. Users may face trade‑offs between model size and responsiveness. Quantization and sparse MoE techniques help, but hardware remains a gating factor. (ollama.com, huggingface.co)
  • Maintenance and security complexity: Running models locally shifts operational burden to the user: model updates, dependency maintenance, and patching become personal responsibilities. Local inference stacks are not immune from misconfiguration or supply‑chain risks.
  • Context management and UX friction: As the experiment showed, chat state persistence across pages can create confusing outputs. Extensions like Page Assist add options (temporary chat modes, explicit reset), but the default behaviour can trip up fast browsing patterns. That’s a usability problem as much as a technical one.
  • Model safety and hallucination: Local models may not have the same safety fine‑tuning or post‑processing pipelines as cloud providers. For high‑stakes summarization (legal, medical, compliance), cloud‑hosted and productized solutions that include built‑in safety checks may still be preferable.
  • Feature parity and ongoing updates: Copilot and similar cloud assistants evolve continuously with product investments; local stacks depend on the open model ecosystem and community contributions. That means sudden improvements to cloud assistants can outpace local progress in both capabilities and UX polish.

Practical recommendations for readers who want to try the local approach​

  • Start small and pragmatic: Use Page Assist + Ollama with a light embedding model (nomic‑embed‑text:v1.5) and a 7B–20B model depending on hardware. This keeps downloads and RAM manageable while providing useful performance.
  • Use temporary chat mode for speed reading: Page Assist has a temporary chat mode and keyboard shortcuts to reset context — enable these to avoid cross‑page bleed when speed reading multiple articles. Be deliberate about starting a new chat for each page.
  • Profile and quantify latency: Benchmark end‑to‑end latency (extract → embed → retrieve → generate) for your chosen models and contexts. If the round trip is too slow, either reduce model size or change retrieval chunking to minimize work.
  • Invest in a quality embedding model: The RAG layer is the multiplier here. Using a proven embedding like Nomic Embed improves retrieval relevance and reduces hallucination risk when summarizing long documents.
  • Combine local and cloud selectively: If privacy isn’t required for every task, use Copilot/Edge for quick browsing summaries and local LLMs for private or specialized material. Hybrid workflows often provide the best combination of speed and control.
  • Plan model upgrades and safety checks: If you rely on local outputs for anything consequential, add validation steps: cross‑reference summaries with original text, or run a second‑opinion model with different seeds or prompt templates.

Why this comparison matters for Windows users and publishers​

For many Windows users — journalists, researchers, IT pros — the daily productivity benefit is not raw model capability alone; it’s how quickly an assistant can integrate with the workflow and suggest the next question. Copilot’s edge with follow‑up prompts and deep Edge integration is materially useful when triaging dozens of pages per day. That’s why Copilot remains the practical choice for a lot of web‑reading work even as local LLM tooling continues to mature. Microsoft’s documentation and product updates emphasize this integrated, grounded page summarization approach. (microsoft.com, support.microsoft.com)
At the same time, the experiment demonstrates that local stacks are rapidly becoming viable for many tasks. Page Assist and Ollama lower the barrier to entry, and open models like gpt‑oss and Nomic Embed provide high‑quality building blocks that are now accessible to power users and small teams. The gap is narrowing — but it’s narrowing along two axes: model capability and UX polish.

Technical note: models to watch and what they change​

  • gpt‑oss:20b / gpt‑oss:120b — OpenAI’s open‑weight release introduced models designed for on‑device or single‑GPU inference with tool use and configurable reasoning levels. Their availability through Ollama and Hugging Face expands what’s possible locally, but they still require tuning and sufficient hardware to match cloud performance. (ollama.com, huggingface.co)
  • Nomic Embed (v1.5 / v2) — Advances in embedding architectures (including MoE in v2) offer better retrieval quality at smaller compute costs; these models are now production‑grade for many RAG tasks. Use them as the retrieval backbone for local summarizers. (home.nomic.ai, docs.nomic.ai)
  • Page Assist’s evolution — The extension is actively developed and adds features like OCR language selection, temporary chat mode, and integrated model pulls from Hugging Face/Ollama. These incremental platform improvements reduce friction when switching or upgrading models. (addons.mozilla.org, github.com)

Final analysis — a pragmatic verdict​

  • For speed and convenience when reading many web pages daily: Copilot remains the best choice for most users because of its browser integration, conversation‑level summaries, and follow‑up prompts.
  • For privacy, custom workflows, or experimentation: A local stack of Page Assist + Ollama + nomic‑embed + a suitably sized LLM is compelling — especially for teams that can invest in hardware and configuration time.
  • For people balancing both worlds: consider a hybrid approach: use Copilot for day‑to‑day browsing; switch to a local stack for sensitive documents or when needing tailored, retrievable knowledge bases.
The open‑model and local tool ecosystem is moving fast. In the months ahead expect the UX and context management issues that currently make Page Assist less "frictionless" than Copilot to be iterated rapidly away. Meanwhile, Microsoft’s Copilot will continue to improve as well, driven by product polish and cloud scale. For professionals juggling speed, privacy, and cost, the sensible posture is experimentation with guardrails: keep using cloud assistants where they save time and use local LLMs where control and customization matter most. (docs.pageassist.xyz, ollama.com)

Quick checklist — what to try next with a local summarizer​

  • Pull Nomic Embed into Ollama: add nomic‑embed‑text:latest and validate embeddings locally. (registry.ollama.com, registry.ollama.ai)
  • Install Page Assist and enable the sidebar keyboard shortcut; enable temporary chat mode for speed reading. (docs.pageassist.xyz, addons.mozilla.org)
  • Start with a 7–20B model depending on GPU/RAM; benchmark end‑to‑end latency and quality.
  • If you need vision/OCR, test a vision‑enabled model and confirm OCR languages and accuracy.
  • Compare outputs against Copilot for several articles and note where each tool surfaces a unique insight or angle.

The race between cloud assistants and local LLM stacks is not purely technical — it’s a UX and trust contest. Copilot currently wins on convenience and conversational finesse for web summarization; local stacks win on privacy, control, and customization. The decision of which to use is driven by what the task values most: immediate speed and idea generation, or control and bespoke behavior. The good news for users is that both roads are improving quickly — meaning more choice, better tools, and fewer compromises.

Source: Windows Central Copilot still smokes local AI in one of my most important use cases