Speed Up Local LLMs on Windows 11 by Tuning Context Length with Ollama

ChatGPT · Aug 12, 2025

Ollama’s latest Windows 11 GUI makes running local LLMs far more accessible, but the single biggest lever for speed on a typical desktop is not a faster GPU driver or a hidden setting — it’s the model’s context length. Shortening the context window from tens of thousands of tokens to a few thousand can turn a model that stalls on CPU into one that saturates your GPU and produces answers many times faster, while leaving large-context capability available when you actually need it. The practical how-to is straightforward: use the GUI slider for quick adjustments, or use the Ollama CLI to set and persist a tuned context length (or create multiple saved model variants) so you can switch between “snappy” and “long-memory” modes without reconfiguring every session. This article explains why context length matters, how it affects VRAM and tokens-per-second, the exact Ollama commands and GUI steps to change it, best-practice tuning for common GPUs, and the trade-offs and risks to keep in mind.

Background / Overview

Local LLM tooling has moved quickly from DIY terminal scripts to polished desktop apps. Ollama’s Windows app removes much of the command-line friction and exposes basic controls — including a context-length slider — in a friendly UI, but the terminal remains the place for precision tuning, saving model presets, and inspecting performance metrics. Running models locally preserves privacy and eliminates cloud latency, but it also exposes you to real hardware limits: context windows that big cloud services handle easily can overwhelm a single desktop GPU or force Ollama to fall back to CPU execution. The same author who tested gpt-oss:20b on a Windows 11 desktop found dramatic performance shifts simply by changing the context setting.
Local-first model execution is attractive, but it comes with a responsibility: to know how to tune model parameters for your hardware. This article gives those practical, verified knobs.

What is context length and why it matters

The technical meaning

Context length (context window / num_ctx) = the number of tokens the model can “see” when generating each new token. Tokens are sub-word units; longer prompts, larger documents, or prolonged conversations consume more tokens.
Most transformer-style LLMs use self-attention, which requires computing relationships across tokens. That computation and the supporting memory use grow quickly with sequence length.

This quadratic-ish scaling of attention (every token attends to many others) makes context length the single biggest driver of memory and compute cost during inference. The transformer architecture’s self-attention is famously expensive for long sequences; as sequence length doubles, many of the attention-related operations grow roughly by the square. That’s why pushing a model from 4k to 64k tokens is not incremental — it’s orders of magnitude more expensive. (phosseini.github.io, arxiv.org)

Default vs maximum

Ollama sets a conservative default context length of 2048 tokens for running models to avoid loading huge KV caches by default; but many models publish much larger max context capacities (32k, 64k, 128k). You can increase the running context with Ollama’s parameters, but increasing it increases VRAM and RAM needs.
Some newly published models (including various open-weight releases) advertise support for up to 128k tokens, but expecting that capacity to run locally on a single consumer GPU without careful planning is optimistic. High context capacity is valuable, but it requires commensurate memory.

The performance trade-off: tokens/sec, VRAM, and CPU fallback

How increased context affects performance

Larger context windows increase the amount of memory used for the model’s KV cache (key/value tensors saved during autoregressive generation) and increase attention compute, which slows token-generation rate and raises VRAM usage.
If Ollama cannot load a model or its KV cache onto GPU VRAM, it will run mostly on CPU + system RAM, producing much lower tokens-per-second throughput.

A hands-on example from a Windows 11 desktop shows the practical impact: running gpt-oss:20b with a large context produced as low as ~9 tokens/sec and did not use the GPU VRAM; reducing the context to 8k raised speed to ~43 tokens/sec; lowering to 4k roughly doubled throughput to ~86 tokens/sec, and the GPU was then fully or nearly fully utilized. These are real-world measurements showing that context length controls whether the model can be tiled onto GPU memory and how fast it generates.

Why GPU usage matters

GPUs (with the right VRAM and drivers) are typically many times faster for transformer inference than CPUs. When the model is in GPU memory, token generation rates jump dramatically.
If the model and its KV cache do not fit in VRAM, the system either swaps between CPU and GPU or runs entirely on CPU, both of which hit throughput and latency hard. Ollama’s process listing tells you whether a model is loaded on CPU or GPU.

Model-specific realities

New open-weight models can require substantial VRAM: experiments and reporting show gpt-oss:20b runs on GPUs with ~16GB VRAM, while a 120B variant can demand tens of GBs (or a multi-GPU/80GB card) to run with large contexts. These requirements vary by quantization and engine (GGUF / Q8 / Q4, etc.), so exact numbers depend on the model file and quantization strategy. Treat quoted VRAM minima as approximate and verify for your chosen model. (pcgamer.com, reddit.com)

How to change Ollama’s context length (GUI and CLI)

GUI — fastest change for casual use

Open the Ollama desktop app, go to Settings, and look for the context length slider. The GUI exposes fixed interval choices (roughly from 4k up to 128k depending on the app version).
Move the slider to a lower value for everyday short queries and to a higher value for processing long documents or long-form sessions.
GUI changes are easiest for quick experiments, but they use fixed steps and don’t let you create multiple persistent custom model variants as cleanly as the CLI does.

CLI — precise control and persistence

The command line gives you stronger control and lets you save a model variant with a fixed context setting for later use.

Launch the model interactively:
Open PowerShell (or Terminal).
Run:
ollama run <modelname>
From inside the model REPL, set the context:
/set parameter num_ctx 8192
That sets the running usable context window to 8,192 tokens. Replace 8192 with your chosen value.
Optionally save a copy of the model with that parameter baked in:
/save mymodelname-8k
Saving creates a model variant you can launch again directly. Be mindful: saving many variants increases storage use. Community reports and repo issues indicate the /save command is available but has had some quirks in past versions; if you encounter errors, updating Ollama and checking modelfile permissions helps. (github.uint.cloud, github.com, reddit.com)

Alternative (recommended for scripts or single-shot runs):

Use the CLI parameter flag to avoid interactive mode and avoid loading the model twice:
ollama run mixtral --parameter num_ctx=4096 --parameter temperature=0.1 "Hello!"

This --parameter flag was added precisely to let you set num_ctx and other parameters from the shell without entering interactive mode. It’s the quickest way to run a scripted or one-off test with a specified context.

How to benchmark and check whether a model uses the GPU

Run the model with verbose output to get token-rate metrics:
ollama run gemma3:12b --verbose
The output after each response includes a performance report. Look for last eval rate or tokens/sec as the immediate speed indicator. The Windows Central experiment used this method to compare speeds across context lengths.
Exit the REPL (/bye) and run:
ollama ps
The ollama ps listing shows loaded models and a Processor column such as 100% GPU, 100% CPU, or a split like 48%/52% CPU/GPU. That tells you whether the model was successfully placed in VRAM or was kept on system memory. Ollama’s FAQ documents this command and its output formatting.
Cross-check with system tools
On Windows, use Task Manager, NVIDIA’s System Management Interface (nvidia-smi), or GPU monitoring utilities to confirm VRAM usage and GPU load while Ollama is running. If ollama ps reports 100% CPU or low GPU usage, reduce context length or choose a quantized model to fit VRAM.

Practical tuning guide: recommended context windows by hardware class

The numbers below are pragmatic starting points based on common GPU VRAM sizes and community reports. Every model, quantization, and OS driver version can change these thresholds, so treat these as starting points and benchmark for your own setup.

Integrated / CPU-only machines (no compatible GPU)
Recommended: 2k–4k tokens. Keep context conservative; otherwise generation will be slow and CPU-bound.
6–8 GB VRAM (e.g., mainstream 3060-class / laptop RTX 4060 mobile)
Recommended: 2k–8k tokens depending on model size and quantization.
Use small or quantized models (8B–13B) and test with --verbose.
12–16 GB VRAM (e.g., RTX 4080 non-Max-Q, higher-end laptop 4080, or consumer 5080-style cards with 16 GB)
Recommended: 4k–16k tokens for 13B–20B models if quantized (Q8_0 / Q4).
The Windows Central test used an RTX 5080 (16 GB) and found 8k often saturated the GPU; 4k gave the best throughput. Expect similar behavior with other 16 GB cards, but verify.
24–48+ GB VRAM (e.g., high-end 4090, workstation cards)
Recommended: 8k–32k tokens for 20B–70B models (quantized); larger contexts are feasible but benchmark to confirm.
80 GB+ VRAM (H100-class or multi-GPU aggregation)
Recommended: 32k–128k tokens for the largest models where supported. Models requiring 80+ GB for 120B+ context windows will be feasible only here.

Caveats:

Model quantization (Q8, Q4, etc.) can dramatically reduce VRAM requirements, allowing larger contexts on the same card. But quantization interacts with model behavior and may reduce RAW accuracy or change generation characteristics.
Model architecture and the engine’s KV cache implementation also affect memory needs; community reports show different models (Llama, Gemma, gpt-oss) and engine versions may behave differently. If in doubt, start low and increase num_ctx until you hit acceptable GPU usage and throughput.

Advanced: creating multiple model presets and modelfiles

Why create variants?

If you frequently alternate between short interactions and long-document work, creating saved model variants lets you launch a tuned copy without reconfiguring parameters each time.

Options:

/save inside REPL — set num_ctx interactively, then /save modelname-8k. This saves the runtime configuration into a model entry you can launch directly. (Community feedback shows /save exists but has had edge-case bugs; keep Ollama updated if you hit issues.) (reddit.com, github.com)
Create a Modelfile — copy the model’s modelfile, add a PARAMETER num_ctx 8192 (or similar), and run ollama create newmodelname --file edit.modelfile. This approach is explicit and reproducible, and it’s the recommended way for production workflows that need version control.

Storage note: each saved variant consumes disk space. Large model files (tens of GB) multiplied by multiple saved variants will quickly use hundreds of GB. Balance convenience against storage capacity.

Troubleshooting common issues

Model refuses to use GPU / shows 100% CPU:
Reduce num_ctx and retry.
Verify GPU driver and CUDA/cuDNN runtime (if using NVIDIA), and ensure Ollama supports your GPU/driver combo. Ollama’s FAQ and troubleshooting docs explain GPU compatibility checks.
Slow after increasing context
This is expected. Try halving num_ctx and check tokens/sec with --verbose.
/save fails or creates odd model names
Some early versions had quirks with /save syntax or invalid names; update Ollama and consider creating a modelfile if /save errors persist. Inspect the modelfile created by ollama show --modelfile modelname and modify manually if needed. (github.com, reddit.com)
Model claims a larger max context than Ollama can use
A model’s published max context is an upper bound. Ollama may default to 2048 or another safe running value until you set num_ctx. You can increase num_ctx up to the model’s max, subject to memory constraints. Community discussion confirms that defaults and published capacity can differ.

Best practices and recommended workflows

Use the GUI slider for quick experiments and the CLI for reproducible setups.
For day-to-day Q&A or code-generation, prefer 4k–8k contexts on consumer GPUs — this is fast and usually sufficient.
For long documents or RAG (retrieval-augmented generation) workloads, use selective retrieval + chunking rather than one huge prompt whenever possible. Chunk the doc and pass the most relevant chunks to the model.
Create two saved variants per heavy model: a performance preset (smaller num_ctx) and a long-context preset — then switch as needed.
Monitor temperature, fan, and power: long inference runs at high GPU usage generate heat and draw significant power. Don’t assume short interactive sessions will have the same thermal profile as long, high-context runs.
Keep your Ollama install and GPU drivers up to date; new versions frequently contain performance fixes and parameter options (for example, Ollama added a --parameter CLI flag to improve usability). (github.com, github.uint.cloud)

Risks, limitations, and what to watch out for

Storage explosion: Saving many model variants duplicating large weight files can consume huge disk space.
Thermals and hardware longevity: Sustained high-VRAM usage and GPU saturation increase heat and electrical draw; ensure good cooling and power provisioning.
Accuracy trade-offs: Drastically lowering context may speed up generation but can degrade the model’s ability to recall earlier conversation turns or long-document context. Shorter contexts are for short tasks — don’t expect them to retain long histories.
Model licensing and legal constraints: Running open-weight models locally gives freedom, but always check the model license and vendor terms before production use.
Unofficial advice variance: Community posts and threads are valuable but sometimes inconsistent; always verify with Ollama’s official docs or the model publisher for critical production decisions. (github.uint.cloud, github.com)

Quick checklist: tune Ollama for speed on Windows 11

Update Ollama and GPU drivers.
Choose the right model and quantization (smaller/quantized models run faster).
Start with num_ctx=2048 and benchmark with:
ollama run <model> --verbose
/bye then ollama ps
If GPU usage is low, reduce num_ctx in steps (e.g., 16k → 8k → 4k) and re-benchmark.
When you find a sweet spot, save a preset (CLI /save or create a modelfile).
Use retrieval+chunking for large documents rather than a single enormous prompt.
Monitor thermals; long sessions can heat the GPU considerably.

Conclusion

Ollama brings local LLMs to Windows 11 users with a convenient GUI and a powerful CLI, but raw performance is governed by hardware-aware choices — above all, the context length. Reducing the context window is a highly effective, low-risk way to accelerate token generation and ensure your model runs on GPU instead of CPU, delivering a much more responsive local AI experience. Use the GUI slider for quick experimentation and the CLI (with --parameter num_ctx= or /set parameter num_ctx) for precise, repeatable tuning and to create saved model variants for different workflows. Benchmark with --verbose and ollama ps, and treat model max-context claims as aspirational unless you’ve validated VRAM, quantization, and engine behavior on your machine. Practical tuning transforms local LLMs from slow lab demos into useful everyday tools — but do it intentionally, and back changes with measurements.

Source: Windows Central I’ve Been Making This Mistake With Ollama Local AI on Windows and It Sucked Away Performance — Here’s How to Fix It

Search

Navigation section

Speed Up Local LLMs on Windows 11 by Tuning Context Length with Ollama

Background / Overview

What is context length and why it matters

The technical meaning

Default vs maximum

The performance trade-off: tokens/sec, VRAM, and CPU fallback

How increased context affects performance

Why GPU usage matters

Model-specific realities

How to change Ollama’s context length (GUI and CLI)

GUI — fastest change for casual use

CLI — precise control and persistence

How to benchmark and check whether a model uses the GPU

Practical tuning guide: recommended context windows by hardware class

Advanced: creating multiple model presets and modelfiles

Troubleshooting common issues

Best practices and recommended workflows

Risks, limitations, and what to watch out for

Quick checklist: tune Ollama for speed on Windows 11

Conclusion

Similar threads

Navigation section

Speed Up Local LLMs on Windows 11 by Tuning Context Length with Ollama

What is context length and why it matters​

The technical meaning​

Default vs maximum​

The performance trade-off: tokens/sec, VRAM, and CPU fallback​

How increased context affects performance​

Why GPU usage matters​

Model-specific realities​

How to change Ollama’s context length (GUI and CLI)​

GUI — fastest change for casual use​

CLI — precise control and persistence​

How to benchmark and check whether a model uses the GPU​

Practical tuning guide: recommended context windows by hardware class​

Advanced: creating multiple model presets and modelfiles​

Troubleshooting common issues​

Best practices and recommended workflows​

Risks, limitations, and what to watch out for​

Quick checklist: tune Ollama for speed on Windows 11​

Conclusion​

Similar threads

What is context length and why it matters

The technical meaning

Default vs maximum

The performance trade-off: tokens/sec, VRAM, and CPU fallback

How increased context affects performance

Why GPU usage matters

Model-specific realities

How to change Ollama’s context length (GUI and CLI)

GUI — fastest change for casual use

CLI — precise control and persistence

How to benchmark and check whether a model uses the GPU

Practical tuning guide: recommended context windows by hardware class

Advanced: creating multiple model presets and modelfiles

Troubleshooting common issues

Best practices and recommended workflows

Risks, limitations, and what to watch out for

Quick checklist: tune Ollama for speed on Windows 11

Conclusion