Llama.cpp ships multi-token prediction beta; KV-cache gets 6x smaller.
Top Signal
Llama.cpp MTP support hits beta — speculative decoding for everyone
platform change
r/LocalLLaMA
Multi-token prediction (MTP) draft models are now usable in llama.cpp's beta branch. MTP lets the model predict multiple tokens per forward pass, with a smaller draft head validating candidates — delivering 1.5-2x throughput gains on consumer hardware without quality loss. This matters because MTP was previously locked behind vLLM or custom forks. Now anyone running llama-server or llama-cli gets it. Qwen3.6 models already ship MTP heads, so the combo is plug-and-play. Action: pull latest llama.cpp, grab a Qwen3.6 model with MTP weights, and add --mtp-draft flag to your server config. Expect biggest gains on long-generation tasks like coding where speculative acceptance rates are high.
Read more →
Fast Signals
FastDMS: 6.4x KV-cache compression that's faster than vLLM baseline
research to practice
r/LocalLLaMA
Researchers from NVIDIA/Warsaw/Edinburgh released FastDMS, a learned per-head token eviction strategy for KV-cache that achieves 6.4x compression while running faster than vLLM's BF16/FP8 baseline. This means you can serve longer contexts on the same hardware or fit larger batches. Worth benchmarking if you're running any vLLM deployment.
Link →
LLMSearchIndex: 200M-page local web search for RAG, no API keys
new tool
r/LocalLLaMA
Open-source library providing a pre-built search index of 200M+ web pages that runs locally. Eliminates the need for Bing/Google search APIs in RAG pipelines. If you're building agents that need web grounding without per-query costs or rate limits, this is a drop-in alternative.
Link →
APEX MoE quants ship 25+ new models with new I-Nano tier
workflow
r/LocalLLaMA
APEX, the MoE-aware mixed-precision quantization strategy, now covers 25+ models beyond the original Qwen 3.5 35B-A3B. The new I-Nano tier pushes aggressive quantization specifically for MoE inactive experts. If you're running MoE models locally, APEX quants give 33% faster inference with minimal quality loss compared to uniform quantization.
Link →
n8n-MCP: let Claude/Cursor build n8n workflows via MCP
new tool
GitHub Trending
MCP server that exposes n8n's workflow builder to Claude Desktop, Claude Code, Cursor, and Windsurf. Describe an automation in natural language and the agent creates the n8n workflow. Useful if you're already in the n8n ecosystem and want AI-assisted workflow creation without switching contexts.
Link →
Gemma 4 GGUFs fixed — chat template bug that stripped tool schemas resolved
platform change
r/LocalLLaMA
The Gemma 4 chat template bug that silently dropped tool parameter schemas from function calling has been patched. Updated GGUFs from bartowski and others are live. If you tested Gemma 4 for tool use and got bad results, re-download and retry — the model is significantly more capable than broken quants suggested.
Link →
OpenAI publishes architecture for low-latency voice AI at scale
workflow
HN Front Page
Technical deep-dive on how OpenAI serves real-time voice with consistent low latency. Covers their streaming architecture, speculative execution for TTS, and how they handle interruption detection. Useful reference architecture if you're building any real-time audio pipeline, even with local models.
Link →
Radar
LocalVQE: 1M-param model for real-time echo/noise cancellation
Tiny audio model that handles acoustic echo cancellation and noise suppression in real-time on CPU. If you're building local voice agents, this solves the preprocessing step without cloud APIs or heavy DSP libraries.
Link →
DeepSeek-TUI: single-binary terminal coding agent
Terminal-native coding agent built for DeepSeek V4's 1M-token context with prefix caching. Ships MCP client, sandbox, and durable sessions — no Node/Python runtime needed. Worth watching as the 'single binary agent' pattern matures.
Link →
Exo disaggregated prefill: DGX Spark + M3 Ultra
Exo now supports splitting prefill (on DGX Spark with 4x matmul perf) and decode (on M3 Ultra with better memory bandwidth) across machines. Points toward a future where heterogeneous hardware clusters become practical for local inference.
Link →
Convergence Watch
qwen 3.6
TRENDING
5 mentions across r/LocalLLaMA, GitHub Trending
Day 7 of sustained Qwen 3.6 discussion. Community has settled on 35B-A3B as the sweet spot for local agentic coding, with 27B for constrained hardware. MTP support in llama.cpp makes this even more relevant.
amd local inference
TRENDING
3 mentions across r/LocalLLaMA
Strix Halo refresh to 192GB unified memory keeps AMD in the conversation. Combined with Mistral Medium 3.5 benchmarks on Halo (slow but functional for 128B models), AMD is becoming a viable local inference platform for models that don't fit in GPU VRAM.
local agentic coding
TRENDING
4 mentions across r/LocalLLaMA, GitHub Trending, HN Show
Third consecutive day. The narrative shifted from 'can local models code?' to 'which harness works best?' — with users reporting Cursor costs of $80/week driving migration. DeepSeek-TUI, OpenCode, and ruflo all trending as alternatives.
STALE: Latent Space newest item is >48h old