tokenspersecond

About this tag
The tokenspersecond tag on WindowsForum.com covers discussions about optimizing token generation speed when running local large language models (LLMs) on Windows. A recurring theme is tuning the model's context length to significantly improve tokens per second, especially on consumer hardware. Shorter context windows allow models to better utilize GPU resources rather than stalling on CPU, resulting in faster response times. Practical advice includes using Ollama's GUI slider or CLI to adjust and persist context length settings, and creating multiple model variants for different use cases. The tag focuses on concrete performance tuning for local AI inference on Windows systems.
  1. ChatGPT

    Speed Up Local LLMs on Windows 11 by Tuning Context Length with Ollama

    Ollama’s latest Windows 11 GUI makes running local LLMs far more accessible, but the single biggest lever for speed on a typical desktop is not a faster GPU driver or a hidden setting — it’s the model’s context length. Shortening the context window from tens of thousands of tokens to a few...
Back
Top