Ollama’s latest Windows 11 GUI makes running local LLMs far more accessible, but the single biggest lever for speed on a typical desktop is not a faster GPU driver or a hidden setting — it’s the model’s context length. Shortening the context window from tens of thousands of tokens to a few thousand can turn a model that stalls on CPU into one that saturates your GPU and produces answers many times faster, while leaving large-context capability available when you actually need it. The practical how-to is straightforward: use the GUI slider for quick adjustments, or use the Ollama CLI to set and persist a tuned context length (or create multiple saved model variants) so you can switch between “snappy” and “long-memory” modes without reconfiguring every session. This article explains why context length matters, how it affects VRAM and tokens-per-second, the exact Ollama commands and GUI steps to change it, best-practice tuning for common GPUs, and the trade-offs and risks to keep in mind. (github.uint.cloud)
Local LLM tooling has moved quickly from DIY terminal scripts to polished desktop apps. Ollama’s Windows app removes much of the command-line friction and exposes basic controls — including a context-length slider — in a friendly UI, but the terminal remains the place for precision tuning, saving model presets, and inspecting performance metrics. Running models locally preserves privacy and eliminates cloud latency, but it also exposes you to real hardware limits: context windows that big cloud services handle easily can overwhelm a single desktop GPU or force Ollama to fall back to CPU execution. The same author who tested gpt-oss:20b on a Windows 11 desktop found dramatic performance shifts simply by changing the context setting.
Local-first model execution is attractive, but it comes with a responsibility: to know how to tune model parameters for your hardware. This article gives those practical, verified knobs.
Source: Windows Central I’ve Been Making This Mistake With Ollama Local AI on Windows and It Sucked Away Performance — Here’s How to Fix It
Background / Overview
Local LLM tooling has moved quickly from DIY terminal scripts to polished desktop apps. Ollama’s Windows app removes much of the command-line friction and exposes basic controls — including a context-length slider — in a friendly UI, but the terminal remains the place for precision tuning, saving model presets, and inspecting performance metrics. Running models locally preserves privacy and eliminates cloud latency, but it also exposes you to real hardware limits: context windows that big cloud services handle easily can overwhelm a single desktop GPU or force Ollama to fall back to CPU execution. The same author who tested gpt-oss:20b on a Windows 11 desktop found dramatic performance shifts simply by changing the context setting.Local-first model execution is attractive, but it comes with a responsibility: to know how to tune model parameters for your hardware. This article gives those practical, verified knobs.
What is context length and why it matters
The technical meaning
- Context length (context window / num_ctx) = the number of tokens the model can “see” when generating each new token. Tokens are sub-word units; longer prompts, larger documents, or prolonged conversations consume more tokens.
- Most transformer-style LLMs use self-attention, which requires computing relationships across tokens. That computation and the supporting memory use grow quickly with sequence length.
Default vs maximum
- Ollama sets a conservative default context length of 2048 tokens for running models to avoid loading huge KV caches by default; but many models publish much larger max context capacities (32k, 64k, 128k). You can increase the running context with Ollama’s parameters, but increasing it increases VRAM and RAM needs. (github.uint.cloud)
- Some newly published models (including various open-weight releases) advertise support for up to 128k tokens, but expecting that capacity to run locally on a single consumer GPU without careful planning is optimistic. High context capacity is valuable, but it requires commensurate memory. (pcgamer.com)
The performance trade-off: tokens/sec, VRAM, and CPU fallback
How increased context affects performance
- Larger context windows increase the amount of memory used for the model’s KV cache (key/value tensors saved during autoregressive generation) and increase attention compute, which slows token-generation rate and raises VRAM usage.
- If Ollama cannot load a model or its KV cache onto GPU VRAM, it will run mostly on CPU + system RAM, producing much lower tokens-per-second throughput.
Why GPU usage matters
- GPUs (with the right VRAM and drivers) are typically many times faster for transformer inference than CPUs. When the model is in GPU memory, token generation rates jump dramatically.
- If the model and its KV cache do not fit in VRAM, the system either swaps between CPU and GPU or runs entirely on CPU, both of which hit throughput and latency hard. Ollama’s process listing tells you whether a model is loaded on CPU or GPU. (github.uint.cloud)
Model-specific realities
- New open-weight models can require substantial VRAM: experiments and reporting show gpt-oss:20b runs on GPUs with ~16GB VRAM, while a 120B variant can demand tens of GBs (or a multi-GPU/80GB card) to run with large contexts. These requirements vary by quantization and engine (GGUF / Q8 / Q4, etc.), so exact numbers depend on the model file and quantization strategy. Treat quoted VRAM minima as approximate and verify for your chosen model. (pcgamer.com, reddit.com)
How to change Ollama’s context length (GUI and CLI)
GUI — fastest change for casual use
- Open the Ollama desktop app, go to Settings, and look for the context length slider. The GUI exposes fixed interval choices (roughly from 4k up to 128k depending on the app version).
- Move the slider to a lower value for everyday short queries and to a higher value for processing long documents or long-form sessions.
- GUI changes are easiest for quick experiments, but they use fixed steps and don’t let you create multiple persistent custom model variants as cleanly as the CLI does.
CLI — precise control and persistence
The command line gives you stronger control and lets you save a model variant with a fixed context setting for later use.- Launch the model interactively:
- Open PowerShell (or Terminal).
- Run:
ollama run <modelname> - From inside the model REPL, set the context:
/set parameter num_ctx 8192 - That sets the running usable context window to 8,192 tokens. Replace 8192 with your chosen value.
- Optionally save a copy of the model with that parameter baked in:
/save mymodelname-8k - Saving creates a model variant you can launch again directly. Be mindful: saving many variants increases storage use. Community reports and repo issues indicate the /save command is available but has had some quirks in past versions; if you encounter errors, updating Ollama and checking modelfile permissions helps. (github.uint.cloud, github.com, reddit.com)
- Use the CLI parameter flag to avoid interactive mode and avoid loading the model twice:
ollama run mixtral --parameter num_ctx=4096 --parameter temperature=0.1 "Hello!"
How to benchmark and check whether a model uses the GPU
- Run the model with verbose output to get token-rate metrics:
ollama run gemma3:12b --verbose - The output after each response includes a performance report. Look for last eval rate or tokens/sec as the immediate speed indicator. The Windows Central experiment used this method to compare speeds across context lengths.
- Exit the REPL (/bye) and run:
ollama ps - The
ollama ps
listing shows loaded models and a Processor column such as100% GPU
,100% CPU
, or a split like48%/52% CPU/GPU
. That tells you whether the model was successfully placed in VRAM or was kept on system memory. Ollama’s FAQ documents this command and its output formatting. (github.uint.cloud) - Cross-check with system tools
- On Windows, use Task Manager, NVIDIA’s System Management Interface (nvidia-smi), or GPU monitoring utilities to confirm VRAM usage and GPU load while Ollama is running. If
ollama ps
reports100% CPU
or low GPU usage, reduce context length or choose a quantized model to fit VRAM.
Practical tuning guide: recommended context windows by hardware class
The numbers below are pragmatic starting points based on common GPU VRAM sizes and community reports. Every model, quantization, and OS driver version can change these thresholds, so treat these as starting points and benchmark for your own setup.- Integrated / CPU-only machines (no compatible GPU)
- Recommended: 2k–4k tokens. Keep context conservative; otherwise generation will be slow and CPU-bound.
- 6–8 GB VRAM (e.g., mainstream 3060-class / laptop RTX 4060 mobile)
- Recommended: 2k–8k tokens depending on model size and quantization.
- Use small or quantized models (8B–13B) and test with
--verbose
. - 12–16 GB VRAM (e.g., RTX 4080 non-Max-Q, higher-end laptop 4080, or consumer 5080-style cards with 16 GB)
- Recommended: 4k–16k tokens for 13B–20B models if quantized (Q8_0 / Q4).
- The Windows Central test used an RTX 5080 (16 GB) and found 8k often saturated the GPU; 4k gave the best throughput. Expect similar behavior with other 16 GB cards, but verify.
- 24–48+ GB VRAM (e.g., high-end 4090, workstation cards)
- Recommended: 8k–32k tokens for 20B–70B models (quantized); larger contexts are feasible but benchmark to confirm.
- 80 GB+ VRAM (H100-class or multi-GPU aggregation)
- Recommended: 32k–128k tokens for the largest models where supported. Models requiring 80+ GB for 120B+ context windows will be feasible only here. (pcgamer.com)
- Model quantization (Q8, Q4, etc.) can dramatically reduce VRAM requirements, allowing larger contexts on the same card. But quantization interacts with model behavior and may reduce RAW accuracy or change generation characteristics.
- Model architecture and the engine’s KV cache implementation also affect memory needs; community reports show different models (Llama, Gemma, gpt-oss) and engine versions may behave differently. If in doubt, start low and increase
num_ctx
until you hit acceptable GPU usage and throughput. (reddit.com)
Advanced: creating multiple model presets and modelfiles
Why create variants?- If you frequently alternate between short interactions and long-document work, creating saved model variants lets you launch a tuned copy without reconfiguring parameters each time.
- /save inside REPL — set
num_ctx
interactively, then/save modelname-8k
. This saves the runtime configuration into a model entry you can launch directly. (Community feedback shows/save
exists but has had edge-case bugs; keep Ollama updated if you hit issues.) (reddit.com, github.com) - Create a Modelfile — copy the model’s modelfile, add a
PARAMETER num_ctx 8192
(or similar), and runollama create newmodelname --file edit.modelfile
. This approach is explicit and reproducible, and it’s the recommended way for production workflows that need version control.
Troubleshooting common issues
- Model refuses to use GPU / shows
100% CPU
: - Reduce
num_ctx
and retry. - Verify GPU driver and CUDA/cuDNN runtime (if using NVIDIA), and ensure Ollama supports your GPU/driver combo. Ollama’s FAQ and troubleshooting docs explain GPU compatibility checks. (github.uint.cloud)
- Slow after increasing context
- This is expected. Try halving
num_ctx
and check tokens/sec with--verbose
. - /save fails or creates odd model names
- Some early versions had quirks with
/save
syntax or invalid names; update Ollama and consider creating a modelfile if/save
errors persist. Inspect the modelfile created byollama show --modelfile modelname
and modify manually if needed. (github.com, reddit.com) - Model claims a larger max context than Ollama can use
- A model’s published max context is an upper bound. Ollama may default to 2048 or another safe running value until you set
num_ctx
. You can increasenum_ctx
up to the model’s max, subject to memory constraints. Community discussion confirms that defaults and published capacity can differ. (reddit.com)
Best practices and recommended workflows
- Use the GUI slider for quick experiments and the CLI for reproducible setups.
- For day-to-day Q&A or code-generation, prefer 4k–8k contexts on consumer GPUs — this is fast and usually sufficient.
- For long documents or RAG (retrieval-augmented generation) workloads, use selective retrieval + chunking rather than one huge prompt whenever possible. Chunk the doc and pass the most relevant chunks to the model.
- Create two saved variants per heavy model: a performance preset (smaller num_ctx) and a long-context preset — then switch as needed.
- Monitor temperature, fan, and power: long inference runs at high GPU usage generate heat and draw significant power. Don’t assume short interactive sessions will have the same thermal profile as long, high-context runs.
- Keep your Ollama install and GPU drivers up to date; new versions frequently contain performance fixes and parameter options (for example, Ollama added a
--parameter
CLI flag to improve usability). (github.com, github.uint.cloud)
Risks, limitations, and what to watch out for
- Storage explosion: Saving many model variants duplicating large weight files can consume huge disk space.
- Thermals and hardware longevity: Sustained high-VRAM usage and GPU saturation increase heat and electrical draw; ensure good cooling and power provisioning.
- Accuracy trade-offs: Drastically lowering context may speed up generation but can degrade the model’s ability to recall earlier conversation turns or long-document context. Shorter contexts are for short tasks — don’t expect them to retain long histories.
- Model licensing and legal constraints: Running open-weight models locally gives freedom, but always check the model license and vendor terms before production use.
- Unofficial advice variance: Community posts and threads are valuable but sometimes inconsistent; always verify with Ollama’s official docs or the model publisher for critical production decisions. (github.uint.cloud, github.com)
Quick checklist: tune Ollama for speed on Windows 11
- Update Ollama and GPU drivers.
- Choose the right model and quantization (smaller/quantized models run faster).
- Start with
num_ctx=2048
and benchmark with: - ollama run <model> --verbose
- /bye then ollama ps
- If GPU usage is low, reduce
num_ctx
in steps (e.g., 16k → 8k → 4k) and re-benchmark. - When you find a sweet spot, save a preset (CLI
/save
or create a modelfile). - Use retrieval+chunking for large documents rather than a single enormous prompt.
- Monitor thermals; long sessions can heat the GPU considerably.
Conclusion
Ollama brings local LLMs to Windows 11 users with a convenient GUI and a powerful CLI, but raw performance is governed by hardware-aware choices — above all, the context length. Reducing the context window is a highly effective, low-risk way to accelerate token generation and ensure your model runs on GPU instead of CPU, delivering a much more responsive local AI experience. Use the GUI slider for quick experimentation and the CLI (with--parameter num_ctx=
or /set parameter num_ctx
) for precise, repeatable tuning and to create saved model variants for different workflows. Benchmark with --verbose
and ollama ps
, and treat model max-context claims as aspirational unless you’ve validated VRAM, quantization, and engine behavior on your machine. Practical tuning transforms local LLMs from slow lab demos into useful everyday tools — but do it intentionally, and back changes with measurements.Source: Windows Central I’ve Been Making This Mistake With Ollama Local AI on Windows and It Sucked Away Performance — Here’s How to Fix It