BUILDER SIGNAL BRIEF

Friday, May 15, 2026

← All Digests

7.8x inference speed with zero output change: Orthrus rewrites the local LLM speed ceiling.

Top Signal
Orthrus-Qwen3-8B: 7.8x tokens/forward, provably identical output research to practice
r/LocalLLaMA
A researcher applied Orthrus heads to Qwen3-8B and achieved 7.8× tokens per forward pass with a frozen backbone and mathematically provable identical output distribution. This is not speculative decoding with a lossy draft model — the output is provably equivalent to the baseline. That distinction is critical for production: most speed techniques involve quality tradeoffs you have to measure per task. If the technique generalizes to other Qwen3 variants and lands in llama.cpp or MLX, it could cut local agentic coding latency by over 80% at zero quality cost. The frozen backbone means no retraining, no alignment regression, and no pipeline changes. **What to do now:** Pull the repo, verify the identical-distribution claim on your specific task profile, and watch for the authors to publish weights for larger model sizes.
Read more →
Fast Signals
Self-hosted MCP server gives local LLMs live financial data new tool
r/LocalLLaMA
Open-source MCP server connects any local LLM to SEC filings, 13F reports, insider and congressional trades, short interest data, and FRED macro data — fully self-hosted with no third-party API keys. If you're building fintech agents or want your LLM reasoning over real market structure, this is the missing data layer. Clone, point at your Ollama or llama.cpp endpoint, done.
Link →
Torrix: LLM observability with no Postgres, no Redis new tool
HN Show
Self-hosted LLM observability tool that requires zero infrastructure dependencies — no Postgres, no Redis, just install and run. Built by a developer frustrated that Langfuse and similar tools require non-trivial infra before you can see what your agents are doing. Worth evaluating if observability setup friction has been the reason you're flying blind in production.
Link →
whichllm: hardware-matched local LLM benchmark rankings new tool
HN Show
Open-source tool that ranks local LLMs by benchmark performance against your specific hardware configuration — 277 HN upvotes signals genuine community need. Eliminates hours of trial-and-error that define local model selection today. Bookmark for the next time you're spec'ing hardware or onboarding a new team member to local inference.
Link →
garrytan/gstack: 23-role Claude Code agent config, public workflow
GitHub Trending
Garry Tan's Claude Code setup — 23 tools configured as CEO, Designer, Eng Manager, Release Manager, Doc Engineer, and QA — released as a public repo. Skip the celebrity angle; the signal is the reference architecture for multi-role agent orchestration within a single coding session. Study the role decomposition, not the name attached to it.
Link →
Gemma4 26B MoE at 128k context on a MacBook Air M5 workflow
r/LocalLLaMA
A developer unlocked Gemma4 26B on an M5 MacBook Air with 128k context and 4 concurrent batches by adding custom turboquant + rotating KV cache support to MLX. The key engineering gap was MLX's lack of turboquant kernels — this fills it. If you're building on Apple Silicon and hitting context walls, this config is worth replicating.
Link →
Self-training on mistakes: small model hits 80% HumanEval, no labels research to practice
r/LocalLLaMA
A small model trained iteratively on its own wrong answers reached 80% HumanEval and beat GPT-3.5 on math — using only self-generated error correction loops, no human labels, no expensive RLHF pipeline. This is a reproducible fine-tuning recipe within reach of any builder with a GPU. Most relevant if you're fine-tuning coding or reasoning models on a budget.
Link →
RAG eval: the most expensive model finished last workflow
r/LocalLLaMA
A developer evaluated multiple models on a production RAG chatbot and found the priciest option performed worst. The post details what actually moved the needle — chunking strategy, retrieval scoring, and query rewriting rather than raw model capability. Read before throwing more API budget at a retrieval quality problem.
Link →
Radar
Dynamic compute budget + Qwen-35B-A3B reaches near GPT-5 on HLE
Allocating compute dynamically to hard subproblems and iteratively evolving answer sections with Qwen-35B-A3B reaches near GPT-5.4-xHigh on the Humanity's Last Exam benchmark. This is a workflow technique, not a model release — potentially replicable on any strong local reasoning model today. Link →
AllenAI MolmoAct2: 5B VLA model for robotics, actively iterated
AllenAI is continuously releasing new fine-tunes of MolmoAct2, their 5B vision-language-action model, across diverse robotics datasets. If you're building physical-world agents, this is becoming the open-source baseline for robot control — worth following the release cadence. Link →
Fully offline suitcase robot: Gemma 4 E4B, 30 sensors, 200ms TTFT
A developer built a fully offline robot on a Jetson Orin NX 16GB running Gemma 4 E4B with 200ms cached TTFT, 30+ sensors, and zero wireless connectivity. The clearest proof yet that capable LLM robotics is achievable on accessible edge hardware without cloud dependency. Link →
Convergence Watch
multi-token prediction TRENDING
3 mentions across r/LocalLLaMA
Seven consecutive days of MTP coverage, and today's ceiling just jumped: Orthrus pushes 7.8x tokens/forward on Qwen3-8B with identical output guarantees, while a 1M-token real-world Qwen 3.6 35B MTP test confirms 1.5x sustained speed gains. MTP is no longer experimental — it is becoming the default optimization layer for local inference.