OpenAI gpt-oss 20b: Local reasoning, but final answers misfire on a school test

ChatGPT · Aug 11, 2025

OpenAI’s new open-weight model suite landed squarely in the spotlight — and when I ran the smaller gpt-oss:20b through a real-world school test designed for 10‑ and 11‑year‑olds, the model proved interestingly capable on paper, but ultimately fell short of beating an actual 10‑year‑old at their own exam.

Background / Overview

OpenAI recently released two open-weight reasoning models — gpt-oss‑120b and gpt-oss‑20b — with a clear aim: make capable reasoning models available for local and on-device use under a permissive Apache 2.0-style approach. The smaller gpt-oss‑20b is explicitly optimized to run in much lighter environments (reported to require ~16 GB of memory), while the larger 120B variant targets heavier workloads on an 80+GB GPU. OpenAI’s official model announcement and model card describe native support for long contexts (up to 128k tokens), chain‑of‑thought (CoT) reasoning, and agentic tool use such as web browsing and function calling. (openai.com, huggingface.co)
Windows Central’s hands-on test used the gpt-oss:20b model pulled into Ollama and asked it to solve a sample UK 11+ practice paper — a real exam style problem set aimed at 10–11 year olds. The reporter fed the PDF into the model’s context and asked it simply to “read the test and answer all questions.” The results were revealing: the model displayed clear reasoning traces in its internal chain-of-thought, but its final answers were often wrong, irrelevant, or transformed into something entirely different from what the exam asked. That mismatch between internal reasoning and final output is the central practical story here.

How the experiment was run (short summary)

Hardware used: a consumer gaming PC with an Nvidia GeForce RTX 5080 (16 GB VRAM), with the system offloading significant work to the CPU and system RAM. The author noted the GPU only handled roughly ~65% of the workload at default settings.
Software: the model was run via Ollama with the test PDF attached in the context window; context lengths were adjusted between runs (8k → 32k → 128k) to observe behavior changes. Hugging Face and Ollama are listed among recommended deployment routes for gpt-oss models.
Prompt: a single instruction to read the sample paper and answer all questions — no explicit instruction to show workings or use a specific answer channel. The author discovered chain-of-thoughts (reasoning traces) present in the system, but these were not consistently reflected in the “final” outputs the model delivered.

What happened in the test: results and the curious “reasoning vs output” gap

The Windows Central reporter ran the test multiple times and reported the following practical findings:

First run: after ~15 minutes of internal thinking, the model returned 80 answers for 80 questions but only ~9 were correct. Many outputs were irrelevant to the corresponding questions. The reasoning buffer (internal CoT) often contained correct step-by-step logic, yet the final answer lines were wrong or non‑sequitur.
Second run (longer context / larger memory allocation): increasing the model’s context length improved internal reasoning and performance on certain question types (e.g., number sequences). But the model still often produced nonsense as the final visible output — even creating its own quiz instead of answering the exam questions. The author suspected the model was reasoning in a separate channel but failing to commit that reasoning to the expected final answer channel.

If you step back, two separate but related phenomena emerge:

Internal competence: the model’s chain-of-thought often looked like a child’s correct working — it could parse constraints, list possibilities, and reason to a solution.
External output failure: that internal reasoning did not consistently map to correct “final” outputs the user sees, producing an outcome far poorer than the internal reasoning suggests.

This disconnect is surprising at first, but it maps onto how the new gpt-oss models were designed: they expose internal CoT channels and use a structured “harmony” response format where different channels (analysis vs final) are meant for separate internal/external uses. If the caller or deployment wrapper mishandles or misformats these channels, you can see internally-correct analyses fail to appear in the public-facing outputs. OpenAI’s documentation warns that full CoT is accessible for debugging and development, and is not intended as the standard end-user response unless you explicitly request the final. This design detail likely explains the experiment’s paradox of perfect reasoning traces but poor final answers. (openai.com, huggingface.co)

Technical verification: key specs and claims

Below are the major technical claims from the Hands-on test and verification against public documents:

gpt-oss:20b is an official OpenAI open-weight release suitable for local inference and can run within ~16 GB of memory. This is corroborated by OpenAI’s launch post and the public Hugging Face model card. (openai.com, huggingface.co)
The models support very long context windows (OpenAI advertises up to 128k tokens natively). That long‑context capability is part of the reason the author tried raising the context window to 32k and 128k during experiments.
The models expose chain-of-thought (“full CoT”) channels and use the “harmony” response format (analysis/commentary/final). Mishandling those channels will show internal reasoning that is not presented as the final answer. OpenAI explicitly documents this behavior. (openai.com, huggingface.co)
Quantization and MoE: OpenAI documents that the gpt-oss family uses MoE (mixture-of-experts) with MXFP4 quantization for the MoE layers to reduce memory footprint; the model’s active parameters per token are lower than total parameters because MoE routes only portions of the network per token. These architectural choices are what make 20B workable on 16 GB hardware. (openai.com, huggingface.co)
Hardware reality: mainstream reviews of the RTX 5080 show that many consumer RTX 50‑series cards ship with 16 GB of VRAM, and that 16 GB is typical for the 5080 — sufficient for running the 20B model but tight for larger context sizes or full precision. The RTX 5090, meanwhile, is marketed with 32 GB of GDDR7 VRAM which better suits larger models and larger context windows. If you want consistent local performance with large contexts, a 32 GB GPU (like many 5090 configs) remains the more robust choice. (pcworld.com, nvidia.com)

Because these claims are central to whether hobbyists can run gpt-oss:20b on a gaming PC, I verified them against OpenAI’s model card, the Hugging Face model page, and independent hardware reviews. The verification shows that the broad claims in the Windows Central test are consistent with public tech documentation, with the caveat that real‑world performance depends heavily on quantization, inference backend, and how you configure the model’s reasoning channels. (openai.com, huggingface.co, pcworld.com)

Why the model reasoned correctly but answered incorrectly: a measured analysis

There are several plausible, evidence-backed technical explanations for the “reasoned correctly but answered wrongly” effect Windows Central saw:

Harmony channels and CoT leakage: gpt-oss exposes internal chain-of-thought in an “analysis” channel that is not the same as the “final” user-facing channel. If your wrapper (Ollama in this case) doesn’t map those channels correctly to the final output or if the prompt format was too blunt (just “answer everything”), the model can produce internal analysis without committing it to the user output. OpenAI’s docs make this channel separation explicit. (openai.com, huggingface.co)
Prompt / instruction design: The experiment’s prompt asked the model to “read the test and answer all questions” but did not explicitly ask for “show your workings and then give final answers” or to “return final answers in the final channel.” Small differences in instruction can change how the harmony renderer composes the final output.
Context window and memory pressure: When the context is too short, the model may truncate parts of the exam or its own reasoning. When the author increased context to 32k and 128k, internal reasoning improved but inference throughput dropped dramatically because the author’s RTX 5080 (16GB) could not keep the full model resident, forcing CPU swapping. Swapping increases the chance of inference artifacts and timeouts and can make final-output routings fragile. The Windows Central token-per-second numbers (e.g., ~9 tokens/s at 128k, ~42 at 8k, ~82 at 4k on that hardware) underscore how context increases cost.
Inference backend behavior (Ollama and quantization): Different backends (vLLM, Ollama, llama.cpp) handle the harmony format and streaming differently. If the backend streams partial outputs or misinterprets the harmony channels when converting internal messages to the console output, you’ll see mismatches. OpenAI and Hugging Face docs advise using the harmony format and provide reference renderers for Python/Rust to reduce these errors. (openai.com, huggingface.co)

In short: the model’s reasoning capability is real, but producing a correct, clean, end-user answer requires careful prompt engineering, correct handling of the harmony channels, appropriate context sizing, and a backend that doesn’t drop or misroute channels.

Strengths observed — what gpt-oss:20b can do well

Readable, stepwise reasoning: The model demonstrates coherent chain-of-thought that mirrors human-style problem solving — useful for debugging and for developers who need to inspect the model’s internal logic.
Local, offline inference possibility: The 20B variant was explicitly designed to run on modest hardware (e.g., 16GB setups), enabling higher privacy and offline usage scenarios for edge applications. That democratizes experimentation outside of cloud APIs.
Tooling and flexibility: Native support for agentic tool use (web search, function calling, Python execution) makes these models practical building blocks for more complex agent workflows — assuming the developer wires them correctly.
Configurable reasoning levels: The model supports “low / medium / high” reasoning effort settings, letting users favor speed over depth or vice versa — valuable for apps that must balance latency and correctness.

Risks, limitations, and practical cautions

Open-weight distribution risk: OpenAI warns that open-weight models introduce new risks: bad actors can fine-tune or modify weights to remove safety constraints. Once the weights are public, OpenAI cannot force centralized policy controls. This is a material shift from API-only models. Deployers must plan for guardrails, monitoring, and content filters.
CoT exposure: The chain-of-thought channel is powerful for debugging but can expose internal reasoning that may be harmful or misused if presented verbatim to end users. OpenAI’s guidance is that CoT is for developer-level debugging and not necessarily user-facing. Treat it as a sensitive output channel.
Hardware mismatch and performance cliffs: Running 20B on a 16GB gaming card is possible but fragile. Context scaling (to 32k/128k) and high‑precision defaults can force CPU swapping and long inference times — impractical for interactive use on modest hardware. If latency or reliability matter, prefer a GPU with 32 GB or use cloud-based H100 or equivalent. Hardware reviewers confirm the practical VRAM needs for large inference workloads. (pcworld.com, techspot.com)
Deployment fragility across backends: Not every inference backend preserves the harmony format or channels identically. If you need production reliability, validate the entire rendering pipeline (prompt → model → renderer → user output), and incorporate tests to ensure that “analysis” content is not leaking into end-user channels incorrectly.

Practical advice: how to reproduce a cleaner test run and avoid the mistakes observed

Install a supported inference backend (Ollama, vLLM, or native PyTorch example from OpenAI). (huggingface.co, openai.com)
Pull the model via the documented commands (e.g., ollama pull gpt-oss:20b) and ensure Ollama is up-to-date.
Use the harmony response format explicitly in your system prompt. For example:
- System: “Render outputs with channels: analysis (internal, not shown), final (user-facing). Only return clear final answers in the final channel.”
- User: “Attached is an 11+ PDF. For each question, provide the final answer only in the final channel. If you need to show working, include it in analysis but do not expose analysis to the user.”
Start with medium reasoning effort, then increase to high only if you can tolerate the latency.
If using limited VRAM (16 GB):
- Use native quantized weights where available (MXFP4) and run with a backend optimized for quantized inference (vLLM / ONNX / TensorRT-LLM).
- Keep context windows as small as possible for interactive tasks (4k–8k) and only escalate context length for batch/offline runs. The Windows Central tests showed steep throughput drops as context increased.
Validate outputs by comparing the model’s final channel to its internal analysis: if internal analysis produces a different answer than final, assume your renderer is misconfigured. Fix the mapping and rerun.

Where gpt-oss:20b fits in the local-model ecosystem (and how it compares)

Compared to small open models (Llama 2, Gemma 3 smalls), gpt-oss:20b offers a middle ground: stronger reasoning than tiny models, but far smaller and more locally friendly than the largest 100B+ models. OpenAI benchmarks present gpt-oss-20b as comparable to “o3‑mini” on common tasks while enabling local runs on modest hardware. (openai.com, huggingface.co)
Competing portable model families like Gemma 3 (Google) provide competitive alternatives; Gemma 3 touts strong single-GPU performance and 128k context support for some variants — but specific model sizes and memory footprints vary, and some smaller Gemma variants (e.g., Gemma 3:12B) can still be memory heavy in practice. Each model family has tradeoffs in latency, memory use, and reasoning accuracy. (blog.google, ollama.com)
If you’re prioritizing raw local responsiveness on constrained hardware, smaller models (e.g., Gemma 3 4B or Gemma 3 1B) or highly quantized LLaMA variants will generally deliver faster, more predictable latency than a 20B model at large context sizes. If you prioritize on‑device reasoning depth and can accept longer runtimes, gpt-oss:20b is a compelling option. (blog.google, huggingface.co)

Final assessment: is gpt-oss:20b “smarter than a 10‑year‑old”?

The answer depends on how you define “smarter.”

If “smarter” = able to produce a correct final answer to a constrained set of 11+ questions reliably and quickly when used out-of-the-box, then no — in the Windows Central experiment the reporter’s 10‑year‑old outperformed the model’s delivered outputs. The model’s raw reasoning traces sometimes matched or exceeded a child’s problem‑solving, but the final output stream too often failed to reflect that reasoning.
If “smarter” = capable of human-like chain-of-thought reasoning and the ability to manipulate constraints and solve problems when properly prompted and deployed, then yes — the model has reasoning competence approaching or exceeding a typical 10‑year‑old in isolated internal chains of thought, provided you map analysis to final outputs correctly and accept slower runtimes or better hardware. The underlying reasoning capability is real and visible. (openai.com, huggingface.co)

So: the model’s raw cognitive machinery is strong, but the practical, user-facing behavior depends on deployment, prompt engineering, and hardware. The Windows Central experiment was a useful, real-world stress test that exposes how fragile the user experience can be if harmony channels, context sizes, and hardware constraints are not handled with care.

What this means for Windows enthusiasts, families, and developers

For parents and teachers: don’t worry — your child’s exam preparation is not yet under threat from a local hobbyist LLM. A human learner with domain knowledge and exam practice still outperforms the model in practical reliability for now. The model can be a useful study assistant (explaining steps, generating practice questions) but is not a safe substitute for reliable final answers unless carefully controlled.
For hobbyists: gpt-oss:20b is exciting — it’s the first time OpenAI’s open‑weight models give you direct, modifiable access to CoT and local inference. But expect to spend time on prompt engineering, renderer configuration, and hardware tuning before you see production-grade behavior. (openai.com, huggingface.co)
For enterprises: the open-weight release reduces barrier-to-entry for local, on-prem deployments — but it also raises governance questions. An open model means you must own your safety stack. OpenAI’s own model card explicitly warns about this increased risk profile and the responsibilities that come with open weights.

Closing thoughts

OpenAI’s gpt-oss models represent a major shift: the company has deliberately made powerful reasoning models available to the community, and that democratization is both exciting and complicated. The Windows Central hands-on showed precisely where the rubber meets the road in the current era of on-device LLMs: the model’s internal reasoning can be excellent, but that alone is not enough — robust, predictable, user‑facing behavior depends on correct handling of harmony channels, sane context window choices, compatible inference backends, and adequate hardware.
If you plan to run gpt-oss:20b at home, treat the model like a powerful, temperamental lab tool: read the model card, use a backend that supports the harmony format, validate rendered outputs against any internal analysis, and pick GPU memory that fits your worst-case (not average-case) context need. Do that, and you’ll be using one of the most interesting open LLMs released in years. (openai.com, huggingface.co)

Source: Windows Central I Asked OpenAI’s New Open-Source AI Model to Complete a Children’s School Test — Is It Smarter Than a 10-Year-Old?

OpenAI gpt-oss 20b: Local reasoning, but final answers misfire on a school test

Background / Overview​

How the experiment was run (short summary)​

What happened in the test: results and the curious “reasoning vs output” gap​

Technical verification: key specs and claims​

Why the model reasoned correctly but answered incorrectly: a measured analysis​

Strengths observed — what gpt-oss:20b can do well​

Risks, limitations, and practical cautions​

Practical advice: how to reproduce a cleaner test run and avoid the mistakes observed​

Where gpt-oss:20b fits in the local-model ecosystem (and how it compares)​

Final assessment: is gpt-oss:20b “smarter than a 10‑year‑old”?​

What this means for Windows enthusiasts, families, and developers​

Closing thoughts​

Similar threads