Software engineer Jeff Schomay’s weekend experiment—feeding an old-school, ASCII-based roguelike called Thunder Lizard into modern generative models and streaming back full-motion, photoreal-style frames—reads like a proof‑of‑concept for what many game developers quietly hope AI can do: turn tiny, symbolic input into rich visuals on the fly. The demo is charming and disorienting in equal measure: you can still play the original ASCII engine, but the screen is now a parade of AI-rendered scenes that reinterpret those characters as dinosaurs, foliage, and lava-lit vistas. The catch is the brutal practicality of the compromise: to make the visuals “playable” Schomay says he ended up with an AI path that yields roughly 10 frames per second, requires tough model and pipeline tradeoffs, and exposes fundamental problems—latency, frame-to-frame consistency, and cost—that keep this approach far from production-ready for anything but novelty projects. (blog.jeffschomay.com)
The idea of using AI not just to create static images but to generate visuals at runtime has leapt from research labs into public demos over the last 18 months. Microsoft’s WHAMM (World and Human Action MaskGIT Model) and its Quake II demo showed that a world model can be trained to accept player inputs and produce plausible gameplay frames in real time, but that demo also exposed the gap between “cool demo” and “usable game”: WHAMM can run at roughly 10 frames per second for a constrained level, but it suffers from input lag, short context windows, and fuzzy enemy interactions. The model is a leap technically—MaskGIT-style parallel token generation lets it produce frames much faster than earlier autoregressive approaches—but it still behaves like an approximation of the game world rather than a faithful engine. (microsoft.com, tomshardware.com)
At the same time, companies such as Fal.ai are building inference stacks specifically engineered to drive real‑time generative workloads—optimizing batching, warm-up, and efficient GPU routing to cut response times and make image-to-image or text-to-image generation usable inside apps. Their public documentation and performance notes show this is a practical direction: with careful engineering you can bring model inference end-to-end into the hundreds of milliseconds for some image tasks, and Fal’s realtime stack explicitly sells itself on enabling interactive creative apps that require low-latency media generation. That said, claims of “real-time” are conditional—real-time in a developer demo does not mean real-time for competitive gaming or fast-paced action titles. (fal.ai, blog.fal.ai)
These two trends—research prototypes that generate frames from gameplay data, and infrastructure platforms optimized for low-latency inference—meet in Schomay’s experiment. His work is the practical, indie-developer counterpart to WHAMM and Fal.ai: a small game, an off-the-shelf (and custom-tuned) model, and a focus on what’s needed to make the experience feel interactive rather than a slideshow.
Why this happens:
The industry is making rapid progress: inference platforms like Fal.ai are closing the latency gap, and research projects such as Microsoft’s WHAMM show that trained world models can be interactive. Yet today’s demos—Schomay’s included—are best read as research-forward prototypes and creative experiments, not blueprints for replacing traditional rendering in mainstream, twitch-based games.
For devs, the sensible path is hybrid: retain deterministic game-state logic, experiment with AI rendering as an optional aesthetic layer, and design gameplay to tolerate the model’s quirks. For the research community and infrastructure providers, the priorities are clear: longer context windows, stricter temporal consistency mechanisms, and cheaper, geographically distributed inference to bring costs and latency down to gamer-friendly thresholds.
The practical reality is this: real-time AI rendering is here, but it’s evolving. The next 12–24 months will determine whether these techniques become a production tool or a persistent novelty. In the meantime, experiments like Thunder Lizard are invaluable—both as inspiration and as a reminder of the hard engineering still left to solve. (blog.jeffschomay.com, microsoft.com, fal.ai, blog.fal.ai)
Source: Tom's Hardware AI converts ASCII game to real-time AI-rendered graphics – Thunder Lizard ASCII visuals transformed, but latency and consistency need improvement
Background: why this matters now
The idea of using AI not just to create static images but to generate visuals at runtime has leapt from research labs into public demos over the last 18 months. Microsoft’s WHAMM (World and Human Action MaskGIT Model) and its Quake II demo showed that a world model can be trained to accept player inputs and produce plausible gameplay frames in real time, but that demo also exposed the gap between “cool demo” and “usable game”: WHAMM can run at roughly 10 frames per second for a constrained level, but it suffers from input lag, short context windows, and fuzzy enemy interactions. The model is a leap technically—MaskGIT-style parallel token generation lets it produce frames much faster than earlier autoregressive approaches—but it still behaves like an approximation of the game world rather than a faithful engine. (microsoft.com, tomshardware.com)At the same time, companies such as Fal.ai are building inference stacks specifically engineered to drive real‑time generative workloads—optimizing batching, warm-up, and efficient GPU routing to cut response times and make image-to-image or text-to-image generation usable inside apps. Their public documentation and performance notes show this is a practical direction: with careful engineering you can bring model inference end-to-end into the hundreds of milliseconds for some image tasks, and Fal’s realtime stack explicitly sells itself on enabling interactive creative apps that require low-latency media generation. That said, claims of “real-time” are conditional—real-time in a developer demo does not mean real-time for competitive gaming or fast-paced action titles. (fal.ai, blog.fal.ai)
These two trends—research prototypes that generate frames from gameplay data, and infrastructure platforms optimized for low-latency inference—meet in Schomay’s experiment. His work is the practical, indie-developer counterpart to WHAMM and Fal.ai: a small game, an off-the-shelf (and custom-tuned) model, and a focus on what’s needed to make the experience feel interactive rather than a slideshow.
The demo: Thunder Lizard, ASCII, and an AI rendering pipeline
What Thunder Lizard is (and why it’s a smart test case)
Thunder Lizard is a compact ASCII roguelike: simple input (cursor keys), a tiny visual vocabulary (characters and colors), and straightforward mechanics (eat smaller dinosaurs, avoid larger ones, grow, and outrun a volcano). That makes it an ideal stress test for runtime image generation, because:- The game state is small and discrete: positions of entities and tile types compress easily for model conditioning.
- Visual fidelity expectations are low: even modest generative outputs drastically change the player’s perception.
- The core loop is short and repeatable, so pipeline artifacts and failure modes appear quickly.
The pipeline Schomay experimented with
While Schomay’s blog post walks through many model trials, the practical playability choice landed on a fast Fal.ai endpoint. The rough pipeline is:- Capture ASCII frame or minimal scene rasterization from the game engine.
- Send the source image (or a compact representation) to a real-time image-to-image or image-enhancement model hosted on a low-latency endpoint.
- Receive the generated frame and composite or stream it to the player, synchronizing inputs so gameplay remains as responsive as possible.
Key results and the uncomfortable tradeoffs
Performance: 10 FPS and a suspicious 1ms claim
The most eye-catching numbers from the writeups and subsequent press coverage are ~10 fps for the generated visuals and an assertion of 1 ms latency in some phrasing. These two claims don’t sit well together when you parse them technically.- 10 frames per second (10 fps) is playable only in a relaxed sense. It’s fine for proof-of-concept demos and some turn‑based or slow-action games, but it’s far below the 30–60 fps most players expect for real-time action. Microsoft openly reported similar plateaus for WHAMM (10+ fps for Quake II’s level in their demo). (microsoft.com)
- A 1 millisecond end-to-end latency for an image generation pipeline is implausible at present. Even aggressive, optimized image inference often measures GPU compute times in the tens to hundreds of milliseconds and network + queuing easily add tens to hundreds more. Fal.ai’s own performance reporting shows realistic image-to-image inference numbers in the low hundreds of milliseconds for optimized endpoints, with GPU inference times sometimes around 120ms for quick models and end-to-end times in several hundred milliseconds depending on steps and model choice. Claims of single-digit millisecond total latency are effectively impossible unless the “1 ms” refers to a narrowly scoped internal metric (for example, a tiny model’s GPU kernel time under specific micro-benchmarks) rather than the real-world gamer's experience. That distinction matters: single-millisecond GPU kernel times do not translate to single-millisecond playable latency when you include encoding, transport, and rendering. (blog.fal.ai, okeiai.com)
Visual quality vs. fidelity and consistency
Generative models are excellent at inventing details, but they struggle to preserve exact continuity across frames. Schomay’s samples make this obvious: you’ll see gorgeous frames that reinterpret the map beautifully, but the same object can shift appearance or position between frames, objects can flicker in and out of existence, and animated movement sometimes looks like temporal collage rather than true motion.Why this happens:
- Most image-to-image models optimize per-frame realism, not temporal coherence.
- Conditioning signals (e.g., the ASCII frame rasterization) are lightweight and often underspecified—models “hallucinate” plausible content that fits the prompt but doesn’t strictly obey world state.
- Even models designed for video or frame-to-frame consistency (latent consistency models, temporal diffusion, MaskGIT variants) rely on longer context and higher compute budgets to preserve identity across frames.
Cost and infrastructure realities
Real-time generation changes the economics of games drastically:- Instead of paying a one-time asset cost (artist + storage), you pay per-frame inference or maintain heavy GPU capacity for sub-second responses.
- Running thousands of concurrent players means provisioning low-latency GPU capacity near users (edge or regional deployments), which remains expensive.
- For indie devs, consumer-grade GPUs on a single machine can’t approach the throughput needed for per-player generative rendering—cloud-backed inference or model distillation are the practical routes.
Why Schomay’s approach still matters (the strengths)
Don’t let the caveats obscure the positives. Schomay’s experiment surfaces several strengths that make the technique worth watching:- Rapid creative iteration: Indie developers can prototype a new look or aesthetic instantly without commissioning whole sprite sheets or 3D models. Schomay’s pipeline turned ASCII cells into dramatically different visual themes with a few model tweaks. (blog.jeffschomay.com)
- Accessibility and democratization: Small teams and solo devs can experiment with visual styles previously off-limits due to artist costs.
- New gameplay opportunities: Procedural, on-the-fly reimagining of scenes could be a feature, not a bug—games might intentionally shift visual style, making the AI’s generation part of the experience (narrative hallucination, dream sequences, or player-driven aesthetic modes).
- A real testing ground for research: Real games with live inputs provide a difficult stress test for models and inference systems—exactly the kind of testbed researchers need to make these systems better faster.
The risks and technical obstacles that still demand work
Frame-to-frame consistency and deterministic identity
Players expect objects to remain recognizable. Current diffusion-based or transformer-based image generators usually lack the long-term statefulness required to preserve identity across many frames. Research into latent diffusion models with temporal conditioning, explicit memory tokens, or model hybrids (neural renderer + asset store) is promising, but it’s still a work in progress. Microsoft’s WHAMM work and other labs are addressing this, but the limitations—fuzziness of enemies and short context—are unresolved for demanding gameplay. (microsoft.com)Input latency and perceived responsiveness
Even if your pipeline can produce a frame in 100–300 ms, network jitter, queuing, encoding/decoding and browser or compositor latency add to the total. For fast action games, players require frame-to-action latency in the tens of milliseconds to feel responsive. Until real-time generated frames regularly live in that window, hybrid approaches (neural upscaling, assistive generative overlays, or procedural enhancement of traditional renderers) will be far more practical for mainstream titles. Fal.ai’s optimizations move towards these windows, but they do not close the gap alone. (blog.fal.ai, fal.ai)Hallucinations, correctness, and gameplay integrity
Generative models are optimized for plausible content, not correct content. That mismatch matters in a game where collision boxes, hit locations, and state must be precise. Schomay’s project cleverly separates logic (ASCII engine) from visuals, but if an AI visual is used to drive gameplay decisions or occlusions, hallucinations introduce bugs. The safer path is to keep logic deterministic and have AI render on top of authoritative state—a design Schomay follows, but one that reduces how “integrated” generative visuals feel.Legal, ethical, and copyright issues
Real-time generative models are usually trained on broad collections of images. That raises questions around copyrighted art, dataset provenance, and the risks of producing outputs that echo protected styles or assets. Platforms like Fal.ai emphasize developer control and private model hosting, but legal clarity—especially for commercial games distributed at scale—is still murky. Developers must offset risk via provenance, licensing, and careful model curation.Practical advice for developers who want to try this
If you’re an indie dev tempted to recreate Schomay’s experiment, here’s a pragmatic checklist distilled from what worked and what didn’t:- Start small and separate concerns.
- Keep game logic in a deterministic engine and use the AI strictly for visuals or atmosphere.
- Target slow-motion or stylistic games first.
- Turn-based, card, or slow‑paced roguelikes tolerate 10–15 fps visuals more readily than twitch shooters.
- Choose the right inference platform for your needs.
- Use low-latency endpoints designed for interactive usage; Fal.ai’s realtime offering is specifically positioned for this use-case. (fal.ai)
- Cache aggressively.
- Pre-generate likely frames, use local caching, and warm endpoints to minimize cold-start delays.
- Design gameplay around the model’s limits.
- Expect occasional frame inconsistency; make that part of the aesthetic or provide fallback UI so players aren’t disoriented.
- Optimize data paths.
- Send compact conditioning data rather than full raw frames when possible, and use binary protocols or WebSockets to reduce transport overhead.
- Watch compute costs and instrument telemetry.
- Measure per-frame inference time, network latency, and GPU queue times. Make sure you can turn the feature off for budget-friendly modes.
Where the research and industry are heading
There are clear paths forward that make this future less speculative:- Masked, parallel token generation and temporal consistency research (the WHAM → WHAMM progression) shows model architectures can evolve to target interactive rates, but training data design, curated testers, and purposeful context collection all matter—as Microsoft illustrated by re-designing training to focus on a single Quake II level. (microsoft.com)
- Inference stacks that combine batching, smart routing, and warm pools (Fal.ai’s inference engine) materially reduce end-to-end latency and make interactive apps more feasible. Expect more serverless and edge offerings aimed at game studios. (blog.fal.ai, fal.ai)
- Hybrid rendering pipelines will dominate near-term production: neural upscalers, neural texture compression, and AI-assisted asset generation will augment rather than replace rasterization or engine-driven rendering. NVIDIA and others are already pushing this direction with neural texture compression and DLSS lineages—a trend covered across multiple industry discussions. (spectrum.ieee.org, pcgamer.com)
Conclusion: a useful demo, not a finished revolution
Jeff Schomay’s Thunder Lizard experiment is a tightly focused demonstration of what generative AI can add to a live game loop: instant creative variation, a rapid path from minimal art to lush visuals, and an approachable testbed for developer pipelines. It also exposes the practical hard edges of the problem—latency (the real-world total is far higher than any single internal metric), the brittleness of temporal coherence, and the compute/cost challenge of scaling.The industry is making rapid progress: inference platforms like Fal.ai are closing the latency gap, and research projects such as Microsoft’s WHAMM show that trained world models can be interactive. Yet today’s demos—Schomay’s included—are best read as research-forward prototypes and creative experiments, not blueprints for replacing traditional rendering in mainstream, twitch-based games.
For devs, the sensible path is hybrid: retain deterministic game-state logic, experiment with AI rendering as an optional aesthetic layer, and design gameplay to tolerate the model’s quirks. For the research community and infrastructure providers, the priorities are clear: longer context windows, stricter temporal consistency mechanisms, and cheaper, geographically distributed inference to bring costs and latency down to gamer-friendly thresholds.
The practical reality is this: real-time AI rendering is here, but it’s evolving. The next 12–24 months will determine whether these techniques become a production tool or a persistent novelty. In the meantime, experiments like Thunder Lizard are invaluable—both as inspiration and as a reminder of the hard engineering still left to solve. (blog.jeffschomay.com, microsoft.com, fal.ai, blog.fal.ai)
Source: Tom's Hardware AI converts ASCII game to real-time AI-rendered graphics – Thunder Lizard ASCII visuals transformed, but latency and consistency need improvement