The Brief, Saturday, April 18, 2026

79 tokens per second. That is the measured throughput of Qwen 3.6 on an Nvidia RTX 5070 Ti, a card you can buy at Best Buy for under $800. The model behind that number has 35 billion parameters in total but activates only 3 billion of them on any given inference pass. Across multiple independent benchmarks posted this week, it outperforms Google's Gemma 4 at 26 billion parameters on agentic coding tasks. Users on r/LocalLLaMA are calling it the first local model worth daily-driving for real coding work. Fifteen independent posts appeared in 24 hours, the fastest community adoption since Meta released Llama 3.

The interesting part is not the benchmark. It is the architecture. Mixture-of-experts, or MoE, is a design where a model contains many specialized sub-networks ("experts") but a routing layer activates only a small subset for each token. The result is a model that carries the knowledge breadth of 35 billion parameters while consuming the compute budget of 3 billion. This is not quantization, which trades accuracy for size. This is structural. The full model runs at full precision. Nothing is lost.

A quiet flag in the llama.cpp inference engine makes this work on hardware it should not work on. The --n-cpu-moe parameter offloads inactive experts to system RAM while keeping active experts on the GPU. This means a single consumer GPU with 16 gigabytes of VRAM can run the full 35 billion parameter model with 128,000 tokens of context. A year ago, that workload required renting an A100 from a cloud provider at several dollars per hour.

On the same day those benchmarks circulated, Cloudflare open-sourced a tool called Unweight. It compresses LLM weights by 15 to 22 percent with zero accuracy loss. On a Llama 3.1 8B model, that translates to roughly 3 gigabytes of saved VRAM on an H100. The word "lossless" is doing important work in that sentence. Previous compression techniques, quantization chief among them, always involved a tradeoff. Unweight does not. It is a free reduction in the memory footprint of any model you deploy.

These two developments, arriving on the same weekend, share a structural shape. They both reduce the hardware floor for running capable models without reducing capability. MoE does it architecturally. Unweight does it at the storage layer. Stack them together and the cost of running a frontier-adjacent coding agent drops from "cloud GPU rental" to "hardware you already own."

This pattern has a historical precedent worth naming. In the early 2010s, the shift from on-premises servers to cloud computing was driven by a simple economic argument: why buy hardware when you can rent it by the hour? Amazon Web Services grew into a $80 billion business on that logic. But the logic only holds when the workload is bursty or when the capital cost of ownership is prohibitive. When utilization is high and the hardware is cheap enough, the math reverses. Dropbox famously saved $75 million over two years by pulling workloads off AWS and onto owned hardware. Basecamp did the same and wrote about it loudly.

AI inference is approaching its own repatriation moment. For the past three years, the assumption has been that serious AI workloads require cloud GPUs: H100s from Nvidia, rented through AWS, Azure, or Google Cloud, or accessed indirectly through API providers like OpenAI and Anthropic. That assumption was correct when capable models required 70 billion dense parameters and 80 gigabytes of VRAM. It is no longer correct when MoE models deliver equivalent capability at one tenth the active compute.

The implications compound in a specific direction. Coding agents are the highest-value, highest-frequency use case for LLMs right now. The Claude Code ecosystem has been trending for six consecutive days across Hacker News and GitHub. New tools are wrapping it in web GUIs, connecting it to browser DevTools via MCP, and building orchestration layers on top. All of that ecosystem activity assumes a cloud-hosted model behind the API. If the local alternative is genuinely competitive for coding tasks, the pricing pressure on API providers intensifies considerably.

This does not mean cloud inference disappears. Long-context research, multimodal reasoning, and tasks requiring the absolute frontier will stay on rented hardware for the foreseeable future. But coding, the use case that drives most of today's AI revenue, may be the first major workload to split. Some developers will keep paying for Claude or GPT subscriptions because the convenience is worth it. Others, particularly those running agents continuously or processing proprietary code they prefer not to send to a third party, will shift to local inference as the capability gap closes.

The Qwen 3.6 result matters not because one model had a good benchmark week. It matters because MoE architecture, combined with inference engine optimizations like CPU expert offloading, has crossed a threshold. The model running on your desk is no longer a toy version of the model running in the cloud. It is, for an increasing number of practical tasks, the same thing. The community already knows this. A 27 billion parameter variant of Qwen 3.6 won a community vote and is expected soon, optimized for even more modest hardware.

Watch the next six months for the pricing response. When local models are good enough for daily coding work, the value proposition of API subscriptions shifts from "capability you cannot get elsewhere" to "convenience and integration." That is a much harder product to charge a premium for.

THE THREE BILLION PARAMETER TRICK