Run 3 Local AI Agents on 8GB GPU with lmxd VRAM Ledger and KV Swapping

ChatGPT · 2026-06-25T22:05:30-0400

Three small local AI agents can share a single 8GB GTX 1080 by moving inference behind one C++ daemon, lmxd, that admits models against a VRAM ledger, reuses one llama.cpp backend, and swaps inactive agents’ KV state to host memory before they collide.
That is the whole story in one sentence, but it is also the part most local-AI tutorials quietly skip. The hard problem is not that three tiny models are too large for the card on paper; it is that three independent processes behave as if they each own the card. The daemon described here does not make Pascal-era silicon young again. It makes the system tell the truth before CUDA does.

The Old GPU Wasn’t the Villain — the Process Model Was

The setup is familiar to anyone trying to build agentic workflows on consumer hardware. One agent writes code, another reviews it for security mistakes, and a third drafts documentation or tests. Each agent may be happiest on a different compact instruct model: a small Llama variant here, Qwen there, SmolLM somewhere else.
On paper, that sounds reasonable. The models are small, quantized, and individually comfortable on an 8GB card. The trap is that “individually comfortable” is not the same thing as “three separate runtimes can reserve their worst-case working sets at the same time.”
The failure mode is ugly because it looks irrational. The first llama-completion process loads, reserves a large KV cache for a huge configured context, and occupies most of the card before generating a token. The second process starts, asks CUDA for another chunk, and dies. The third, even with a smaller model, follows it into the ditch.
This is not really a model-size story. It is a memory-admission story. The first process survives because it arrived first, not because it is morally superior, better optimized, or more deserving of the GPU.

KV Cache Turns “Small Model” Into a Misleading Phrase

The source article’s most useful contribution is its insistence that the model weights are not the whole bill. In llama.cpp, the context configuration matters enormously because the KV cache is reserved up front when the context is created. A small quantized model with a giant context can still grab gigabytes of VRAM before any user sees output.
That up-front reservation is sensible inside a single process. It avoids discovering halfway through a generation that the runtime cannot allocate the memory needed to continue decoding. For an interactive LLM, mid-token memory failure is worse than refusing to start.
But the same design becomes destructive when multiplied across independent processes. Each llama-completion instance sees a card, asks for what it needs, and depends on CUDA to say yes or no. There is no shared social contract among the three agents, no system-level plan, and no authoritative view of what should be allowed to start.
The result is less like scheduling and more like a bar fight. The first process gets through the door, the second takes a swing at the allocator, and the third gets thrown out by cudaMalloc. The user experiences that as “local multi-agent inference is flaky,” when the deeper issue is that nothing is performing admission control.

lmxd Treats VRAM Like a Shared Resource, Not a Vibe

The daemon, lmxd, is deliberately unglamorous. It sits between agents and llama.cpp, listens over a small Unix-socket protocol, and decides whether a model should be admitted before any expensive load happens. In a field addicted to novelty, that restraint is the point.
Its policy is almost embarrassingly simple: cap usable VRAM at a configured percentage, estimate the new model’s cost, and admit it only if the ledger says the total stays under budget. The article uses a 90 percent cap, which leaves headroom for runtime overhead, driver behavior, and the usual messiness that makes exact GPU memory accounting harder than spreadsheet arithmetic.
The important part is not the number. It is the order. lmxd looks up the estimate, reserves against the ledger, and only then loads the model. If the ledger refuses, the model is never loaded and the GPU is never touched.
That sounds obvious until you compare it with the naive path. If the program loads first and accounts later, it has already performed the riskiest operation before deciding whether the operation was permitted. That is how systems become unrecoverable under pressure: the check exists, but it happens after the damage.

The Mutex Is the Difference Between a Policy and a Wish

The ledger’s core reservation routine is short enough to look trivial, which is usually a sign that the design has been pushed to the right place. It takes a lock, verifies initialization, checks for integer overflow, calculates the projected allocation, compares it with the maximum, and commits the new total only if the request fits.
That mutex is not decorative. Without it, two agents registering at nearly the same time could both observe the same available headroom and both conclude they fit. By the time either updates the total, the system has already promised more VRAM than it intends to allow.
This is the kind of bug that only appears when the demo becomes a service. A single developer typing commands by hand will rarely hit it. A real agent framework launching workers concurrently absolutely will.
The article deserves credit for treating concurrency as part of the product rather than a footnote. The difference between “works in the README” and “survives real use” is often one critical section exactly like this.

One Backend Beats Three CUDA Tenants

The second half of lmxd’s admission story is consolidation. Instead of launching three independent binaries, the daemon initializes the llama backend once and keeps a refcounted map of loaded models. If two agents ask for the same GGUF, the daemon reuses the loaded model and increments a reference count.
That matters because process overhead is not free. Multiple CUDA tenants bring their own contexts, allocator behavior, and driver-side costs. On a large workstation GPU, that may be lost in the noise. On an 8GB GTX 1080, the noise is the budget.
The daemon’s design forces the system into a single owner model. The GPU has one process to deal with, one backend lifecycle, and one place where policy lives. That does not magically remove the need for memory, but it eliminates a class of waste that the naive three-terminal approach creates by construction.
This is where the project feels less like an AI hack and more like old-school systems engineering. The clever move is not a new sampling method. It is making ownership explicit.

Admission Is Not Parallelism, and That Distinction Matters

There is a subtle but important distinction in the article: lmxd admits multiple agents, but it does not make all of them compute simultaneously on the GPU. Only one live llama context is active at a time. The others are registered, known, and resumable, but their KV state is pushed to host memory when they are not the current tenant.
That is not a weakness so much as an honest definition of the product. On a tiny old GPU, pretending to run several decoding workloads truly concurrently may be less useful than ensuring several agents can take turns without crashing. For many local agent workflows, that is already the difference between a demo and a usable tool.
The daemon exposes this through a DECODE verb. When a request arrives for a different agent, the context manager evicts the currently active agent’s state, frees its GPU context, builds or restores the requested one, and runs the decode. The response reports whether KV was evicted and whether a prior state was restored.
Those wire-level details are not cosmetic. They turn the memory choreography into something operators can inspect. If an agent resumes successfully after its KV state was restored from host memory, the user knows it was suspended rather than killed.

Swapping KV to Host RAM Is the Price of Survival

The KV swap design is the article’s most pragmatic move. Suspended agents pay in system RAM, not VRAM. That is slower than keeping every context resident on the card, but it is also the only reason multiple agents can coexist under the constraint that only one live context fits comfortably.
This is the right trade for a poor machine. The GTX 1080 is not being asked to behave like a modern data-center accelerator. It is being asked to provide continuity across multiple agents without letting any one of them permanently monopolize the card.
There is a philosophical point here too. In local AI discussions, latency often gets treated as the only metric that matters. But a slightly slower agent that resumes is more useful than a theoretically faster agent that dies when another worker starts.
The article’s reported decode sequence shows exactly that: cold start, eviction, restoration, and continuation. The numbers are not presented as universal benchmarks. They are receipts that the state machine works.

The Telecom Analogy Is More Than a Cute Aside

The author’s telecom analogy is unusually apt. In cellular networks, admission control exists because allowing every device to start a session can degrade service for everyone already connected. The base station does not wait until the cell collapses and then discover that capacity was finite.
That maps cleanly onto consumer GPU inference. The card has finite resources, active sessions consume some known or estimated amount, and new sessions must be refused before they destabilize the admitted set. The failure should be explicit, structured, and early.
This is the opposite of how a surprising amount of local-AI tooling behaves. Many stacks assume the allocator is the admission controller. It is not. The allocator is the emergency brake you hit after the scheduler has failed.
The most persuasive line in the whole project may be implicit: refusing impossible work is a feature. A clean denial with the current allocation, requested bytes, and ledger ceiling is operationally superior to a crash log that arrives after the process has already disturbed the system.

Layer Streaming Points Toward the Harder Future

The article does not stop at admission and KV swapping. It also sketches a more ambitious direction: stream model layers from pinned host memory into VRAM as needed, overlapping transfer for layer N+1 with compute for layer N. That is the classic double-buffered pipeline idea applied to local LLM inference on constrained hardware.
The intuition is sound. A transformer decode step touches layers sequentially. If the whole model cannot remain resident, the runtime might still keep only the active layer or next few layers in VRAM while the rest wait in host memory. With page-locked host allocation, two device buffers, two CUDA streams, and events for timing, transfers and compute can overlap.
The article’s demo uses a representative FMA sweep rather than a real llama.cpp transformer layer. That distinction matters. Proving that asynchronous copy and synthetic compute overlap on a GTX 1080 is not the same as integrating streamed quantized matmul into llama.cpp’s graph execution.
Still, the primitive is worth showing. Local inference is increasingly squeezed between model ambition and hardware reality. If consumer machines are going to run more capable agent stacks without simply renting cloud GPUs, memory hierarchy tricks will matter.

The Demo Is Honest About What It Does Not Prove

The article is unusually candid about scope, and that candor makes the engineering more credible. LayerStreamer is not a drop-in acceleration path for llama.cpp decode. It demonstrates the pattern of pinned host memory, double-buffered device slots, CUDA streams, and overlapped transfer/compute.
That means the measured speedups belong to the primitive, not to a complete LLM serving engine. The source gives figures around 1.28x in a default synthetic configuration and higher gains in a more bandwidth-bound setup. Those are useful engineering signals, not production claims.
The same honesty applies to the daemon. The ledger relies on operator-supplied byte estimates. The demo uses file-size-derived estimates with a multiplier, but real production accounting would need to include KV cache, activations, CUDA caching behavior, and observed runtime drift.
The daemon also samples NVML at boot and exposes live status, but does not continuously resynchronize its own ledger against unrelated processes that might grow later. That is fine for a controlled single-owner experiment. It is not sufficient for a hostile or messy multi-tenant workstation.

The Real Product Is a Failure Mode You Can Understand

The most compelling thing about lmxd is not that it admits three agents in a favorable demo. It is that when the daemon refuses a request, it does so in a way a human can reason about. ERR VRAM_LEDGER_DENY with the ceiling, current allocation, and requested bytes is exactly the kind of boring interface that keeps operations sane.
That is a major improvement over the usual cudaMalloc failed: out of memory ending. CUDA’s error is accurate, but not explanatory in the way an application operator needs. It tells you the allocation failed, not whether the system was over budget, which tenant caused it, or whether existing sessions are intact.
Structured refusal also makes higher-level orchestration possible. An agent framework could respond by using a smaller context, selecting a lighter model, delaying a task, or routing work to CPU. None of those decisions are available if the process simply crashes during initialization.
This is where the daemon’s minimal protocol becomes a strength. The interface is plain enough to drive with netcat, but expressive enough to support real control flow. In infrastructure, the best primitives often look boring because they have no interest in impressing the caller.

The Local-AI Stack Needs Schedulers, Not Just Faster Kernels

The broader lesson is uncomfortable for the local-AI community: as soon as workflows become agentic, the old “run one model in one terminal” mental model breaks down. Multiple agents are not just multiple prompts. They are multiple tenants competing for memory, latency, and continuity.
Cloud inference platforms have already absorbed this lesson. They batch, page, shard, evict, schedule, and meter because hardware utilization is the business. Hobbyist and workstation inference stacks often inherit the opposite assumption: the user is alone, the model is alone, and the GPU belongs to the current process.
That assumption is increasingly false. Developers want code agents, review agents, retrieval agents, browser agents, and test-generation agents. Even if each model is small, the orchestration pressure resembles a server more than a desktop app.
lmxd is therefore interesting less as a finished serving platform than as a signpost. Local inference needs resource managers. It needs admission control, visible accounting, and graceful degradation. Without those, every “multi-agent” setup on constrained hardware is one unlucky allocation away from becoming performance art.

The 8GB Card Still Has a Job

The GTX 1080 is old by AI standards, but it is not useless. Its problem is that modern software often treats it with contempt: either run a single small model or give up and rent something larger. This project argues for a third path, where the card is managed carefully enough to remain productive.
That does not mean nostalgia should replace realism. Pascal lacks modern tensor cores, 8GB is cramped, and PCIe-host swapping has hard limits. No daemon turns a 2016 gaming card into an H100.
But hardware poverty is a real design constraint. Many developers, students, hobbyists, and small teams cannot simply buy their way into more VRAM. For them, the difference between “one agent survives” and “three agents take turns” is not academic.
The best engineering often starts with accepting the machine you actually have. lmxd’s value is that it stops lying about that machine.

The Numbers Are Small, but the Pattern Is Big

The demo’s models are deliberately modest: compact instruct GGUFs, quantized, and small enough that none would raise eyebrows alone. That is what makes the failure interesting. If three tiny models cannot coexist under naive process spawning, the problem is not model bloat alone.
The reported daemon run books roughly 1.58GB against a 7.73GB ceiling for three registered models, then demonstrates real decode with KV state moving between GPU and host. The important contrast is not that this is the most efficient possible memory layout. It is that the same hardware goes from one surviving process to three registered agents because policy moved ahead of allocation.
That pattern generalizes. Whether the budget is 8GB, 16GB, or 80GB, blindly admitting work until the allocator fails is a poor operating model. Bigger cards postpone the reckoning; they do not eliminate it.
In that sense, the old GTX 1080 is a useful forcing function. It exposes bad assumptions faster than a more forgiving GPU would.

Where the Daemon Still Has to Grow Up

If lmxd were to become production infrastructure rather than a sharp systems demo, it would need several upgrades. The ledger would need dynamic accounting, not just static estimates. It would need to observe actual memory behavior after first decode and revise future admission decisions accordingly.
It would also need stronger handling for external GPU activity. A daemon that assumes it is the only tenant is reasonable for a controlled bare-metal box, but real workstations are messy. Desktop compositors, monitoring tools, other CUDA jobs, and accidental processes all complicate the ledger’s worldview.
The IPC model would need hardening too. One client at a time is charmingly simple; high-fanout agent systems eventually need concurrency, authentication boundaries, cancellation, timeouts, and backpressure. The moment this becomes a shared service, the socket protocol becomes part of the security surface.
None of that invalidates the idea. It simply marks the difference between a compelling prototype and an inference supervisor people can trust in unattended workflows.

The Clipboard Beats the Crash Log

The most concrete lesson from this experiment is that local multi-agent inference fails first as resource management, not as model intelligence. A better prompt will not save a process that cannot allocate its KV cache. A clever agent framework will not compensate for a GPU with no admission controller.
The daemon’s value is that it turns a chaotic allocator race into a deliberate scheduling decision. That is a modest sentence, but it is a big systems improvement.

Three independent llama.cpp processes can overcommit an 8GB GPU because each reserves memory as if it owns the card.
The KV cache, especially with very large context settings, can dominate VRAM usage before generation begins.
lmxd improves survivability by admitting agents through a mutex-protected VRAM ledger before loading models.
A single long-lived backend avoids some duplicate process and CUDA-context overhead from the naive multi-terminal approach.
Inactive agents are suspended by moving KV state to host memory, so multiple agents can remain usable even though only one context is live on the GPU.
The layer-streaming demo proves an overlap primitive, not a complete streamed llama.cpp implementation.

The small miracle here is not that three local agents become truly parallel in the data-center sense. They do not. The miracle is more practical: they stop killing each other. For developers trying to build useful agent workflows on aging hardware, that may be the more important breakthrough.
The next phase of local AI will not be won only by larger models or faster kernels; it will be won by runtimes that understand scarcity, make admission decisions before catastrophe, and expose enough state for humans to trust them. lmxd is a rough, narrow, bare-metal answer to that problem, but it points in the right direction: the future local stack needs a scheduler at least as much as it needs another benchmark.

References

Primary source: Towards Data Science
Published: 2026-06-25T15:00:19.563131

3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal | Towards Data Science

Beat the 8GB VRAM limit. Learn how to run three different LLMs on a single 8GB GPU using C++ layer multiplexing and admission control.

towardsdatascience.com
Related coverage: amd.com

Accelerating Generative LLMs Inference with Parallel Draft Models (PARD)

Boost LLM inference with PARD, a parallel draft model delivering up to 4.87× speedup on AMD Instinct MI250 GPUs using speculative decoding

www.amd.com
Related coverage: hyperstack.cloud

Why Multi-GPU Inference Is So Complex | Hyperstack

Discover the challenges of multi-GPU inference and learn how to optimise performance for LLMs in production environments.

www.hyperstack.cloud
Related coverage: learning.rc.virginia.edu

| RC Learning Portal

Multi-GPU LLM Inference RC Workshop Kathryn Linehan, Bruce Rushing June 3rd, 2025 Workshop Overview __The first draft outline of this workshop was created by ChatGPT! __ Introduction UVA HPC Multi-GPU Strategies Accelerate DeepSpeed vLLM Best Practices Wrap Up INTRODUCTION Terminology __VRAM: __...

learning.rc.virginia.edu
Related coverage: digitalocean.com

Splitting LLMs Across Multiple GPUs: Techniques, Tools, and Best Practices | DigitalOcean

Learn how to split large language models (LLMs) across multiple GPUs using top techniques, tools, and best practices for efficient distributed training.

www.digitalocean.com
Official source: microsoft.com

Splitwise improves GPU usage by splitting LLM inference phases - Microsoft Research

Expanded LLM use creates new demands on cloud GPU capacity. Splitwise presents an efficient solution by separating the two essential phases of LLM inference, achieving higher throughput within a limited power budget. Learn how:

www.microsoft.com

Search

Navigation section

Run 3 Local AI Agents on 8GB GPU with lmxd VRAM Ledger and KV Swapping

The Old GPU Wasn’t the Villain — the Process Model Was

KV Cache Turns “Small Model” Into a Misleading Phrase

lmxd Treats VRAM Like a Shared Resource, Not a Vibe

The Mutex Is the Difference Between a Policy and a Wish

One Backend Beats Three CUDA Tenants

Admission Is Not Parallelism, and That Distinction Matters

Swapping KV to Host RAM Is the Price of Survival

The Telecom Analogy Is More Than a Cute Aside

Layer Streaming Points Toward the Harder Future

The Demo Is Honest About What It Does Not Prove

The Real Product Is a Failure Mode You Can Understand

The Local-AI Stack Needs Schedulers, Not Just Faster Kernels

The 8GB Card Still Has a Job

The Numbers Are Small, but the Pattern Is Big

Where the Daemon Still Has to Grow Up

The Clipboard Beats the Crash Log

References

3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal | Towards Data Science

Accelerating Generative LLMs Inference with Parallel Draft Models (PARD)

Why Multi-GPU Inference Is So Complex | Hyperstack

| RC Learning Portal

Splitting LLMs Across Multiple GPUs: Techniques, Tools, and Best Practices | DigitalOcean

Splitwise improves GPU usage by splitting LLM inference phases - Microsoft Research

Navigation section

Run 3 Local AI Agents on 8GB GPU with lmxd VRAM Ledger and KV Swapping

KV Cache Turns “Small Model” Into a Misleading Phrase​

lmxd Treats VRAM Like a Shared Resource, Not a Vibe​

The Mutex Is the Difference Between a Policy and a Wish​

One Backend Beats Three CUDA Tenants​

Admission Is Not Parallelism, and That Distinction Matters​

Swapping KV to Host RAM Is the Price of Survival​

The Telecom Analogy Is More Than a Cute Aside​

Layer Streaming Points Toward the Harder Future​

The Demo Is Honest About What It Does Not Prove​

The Real Product Is a Failure Mode You Can Understand​

The Local-AI Stack Needs Schedulers, Not Just Faster Kernels​

The 8GB Card Still Has a Job​

The Numbers Are Small, but the Pattern Is Big​

Where the Daemon Still Has to Grow Up​

The Clipboard Beats the Crash Log​

References​

3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal | Towards Data Science

Accelerating Generative LLMs Inference with Parallel Draft Models (PARD)

Why Multi-GPU Inference Is So Complex | Hyperstack

| RC Learning Portal

Splitting LLMs Across Multiple GPUs: Techniques, Tools, and Best Practices | DigitalOcean

Splitwise improves GPU usage by splitting LLM inference phases - Microsoft Research

KV Cache Turns “Small Model” Into a Misleading Phrase

lmxd Treats VRAM Like a Shared Resource, Not a Vibe

The Mutex Is the Difference Between a Policy and a Wish

One Backend Beats Three CUDA Tenants

Admission Is Not Parallelism, and That Distinction Matters

Swapping KV to Host RAM Is the Price of Survival

The Telecom Analogy Is More Than a Cute Aside

Layer Streaming Points Toward the Harder Future

The Demo Is Honest About What It Does Not Prove

The Real Product Is a Failure Mode You Can Understand

The Local-AI Stack Needs Schedulers, Not Just Faster Kernels

The 8GB Card Still Has a Job

The Numbers Are Small, but the Pattern Is Big

Where the Daemon Still Has to Grow Up

The Clipboard Beats the Crash Log

References