BUILDER SIGNAL BRIEF

Monday, May 04, 2026

← All Digests

Llama.cpp ships multi-token prediction beta; KV-cache gets 6x smaller.

Top Signal

Llama.cpp MTP support hits beta — speculative decoding for everyone platform change

r/LocalLLaMA

Multi-token prediction (MTP) draft models are now usable in llama.cpp's beta branch. MTP lets the model predict multiple tokens per forward pass, with a smaller draft head validating candidates — delivering 1.5-2x throughput gains on consumer hardware without quality loss. This matters because MTP was previously locked behind vLLM or custom forks. Now anyone running llama-server or llama-cli gets it. Qwen3.6 models already ship MTP heads, so the combo is plug-and-play. Action: pull latest llama.cpp, grab a Qwen3.6 model with MTP weights, and add --mtp-draft flag to your server config. Expect biggest gains on long-generation tasks like coding where speculative acceptance rates are high.

Fast Signals

FastDMS: 6.4x KV-cache compression that's faster than vLLM baseline research to practice

r/LocalLLaMA

Researchers from NVIDIA/Warsaw/Edinburgh released FastDMS, a learned per-head token eviction strategy for KV-cache that achieves 6.4x compression while running faster than vLLM's BF16/FP8 baseline. This means you can serve longer contexts on the same hardware or fit larger batches. Worth benchmarking if you're running any vLLM deployment.

Link →

LLMSearchIndex: 200M-page local web search for RAG, no API keys new tool

r/LocalLLaMA

Open-source library providing a pre-built search index of 200M+ web pages that runs locally. Eliminates the need for Bing/Google search APIs in RAG pipelines. If you're building agents that need web grounding without per-query costs or rate limits, this is a drop-in alternative.

Link →

APEX MoE quants ship 25+ new models with new I-Nano tier workflow

r/LocalLLaMA

APEX, the MoE-aware mixed-precision quantization strategy, now covers 25+ models beyond the original Qwen 3.5 35B-A3B. The new I-Nano tier pushes aggressive quantization specifically for MoE inactive experts. If you're running MoE models locally, APEX quants give 33% faster inference with minimal quality loss compared to uniform quantization.

Link →

n8n-MCP: let Claude/Cursor build n8n workflows via MCP new tool

GitHub Trending

MCP server that exposes n8n's workflow builder to Claude Desktop, Claude Code, Cursor, and Windsurf. Describe an automation in natural language and the agent creates the n8n workflow. Useful if you're already in the n8n ecosystem and want AI-assisted workflow creation without switching contexts.

Link →

Gemma 4 GGUFs fixed — chat template bug that stripped tool schemas resolved platform change

r/LocalLLaMA

The Gemma 4 chat template bug that silently dropped tool parameter schemas from function calling has been patched. Updated GGUFs from bartowski and others are live. If you tested Gemma 4 for tool use and got bad results, re-download and retry — the model is significantly more capable than broken quants suggested.

Link →

OpenAI publishes architecture for low-latency voice AI at scale workflow

HN Front Page

Technical deep-dive on how OpenAI serves real-time voice with consistent low latency. Covers their streaming architecture, speculative execution for TTS, and how they handle interruption detection. Useful reference architecture if you're building any real-time audio pipeline, even with local models.

Link →

Radar

LocalVQE: 1M-param model for real-time echo/noise cancellation

Tiny audio model that handles acoustic echo cancellation and noise suppression in real-time on CPU. If you're building local voice agents, this solves the preprocessing step without cloud APIs or heavy DSP libraries. Link →

DeepSeek-TUI: single-binary terminal coding agent

Terminal-native coding agent built for DeepSeek V4's 1M-token context with prefix caching. Ships MCP client, sandbox, and durable sessions — no Node/Python runtime needed. Worth watching as the 'single binary agent' pattern matures. Link →

Exo disaggregated prefill: DGX Spark + M3 Ultra

Exo now supports splitting prefill (on DGX Spark with 4x matmul perf) and decode (on M3 Ultra with better memory bandwidth) across machines. Points toward a future where heterogeneous hardware clusters become practical for local inference. Link →

Convergence Watch

qwen 3.6

5 mentions across r/LocalLLaMA, GitHub Trending

Day 7 of sustained Qwen 3.6 discussion. Community has settled on 35B-A3B as the sweet spot for local agentic coding, with 27B for constrained hardware. MTP support in llama.cpp makes this even more relevant.

amd local inference

3 mentions across r/LocalLLaMA

Strix Halo refresh to 192GB unified memory keeps AMD in the conversation. Combined with Mistral Medium 3.5 benchmarks on Halo (slow but functional for 128B models), AMD is becoming a viable local inference platform for models that don't fit in GPU VRAM.

local agentic coding

4 mentions across r/LocalLLaMA, GitHub Trending, HN Show

Third consecutive day. The narrative shifted from 'can local models code?' to 'which harness works best?' — with users reporting Cursor costs of $80/week driving migration. DeepSeek-TUI, OpenCode, and ruflo all trending as alternatives.

STALE: Latent Space newest item is >48h old