Google ships MTP draft models for Gemma 4 — speculative decoding goes mainstream.
Top Signal
Google releases Gemma 4 multi-token prediction draft models
research to practice
HN Front Page, r/LocalLLaMA
Google officially released MTP (multi-token prediction) draft models for Gemma 4, enabling 2-3x faster inference through speculative decoding without quality loss. The draft models are small companions that predict multiple tokens ahead, letting the main model verify in parallel. This lands the same day community members demonstrated MTP working on AMD Strix Halo via llama.cpp PR #22673, suggesting cross-platform support is imminent. For builders: if you're serving Gemma 4 locally or via vLLM, download the MTP drafters now. The technique works with existing quantized GGUFs. Combined with yesterday's llama.cpp MTP beta, speculative decoding is shifting from research curiosity to default serving configuration for any latency-sensitive local deployment.
Read more →
Fast Signals
Computer Use is 45x more expensive than structured APIs
workflow
HN Front Page
Reflex published detailed cost analysis showing computer-use agents burn 45x more tokens than equivalent structured API calls for the same UI automation tasks. If you're building agents that interact with web UIs, this quantifies the ROI of wrapping target apps in proper APIs or MCP servers rather than screenshotting them.
Link →
vibevoice.cpp: TTS + ASR with diarization in pure C++, no Python
new tool
r/LocalLLaMA
Microsoft's VibeVoice model (speech-to-speech with one-shot voice cloning) has been ported to ggml/C++, running on CPU/CUDA/Metal/Vulkan with zero Python dependencies at inference. If you need local voice pipelines without the Python overhead, this is now the most complete single-binary option.
Link →
Qwen3.6 27B FP8 handles 200k context at 80 TPS on single 48GB card
workflow
r/LocalLLaMA
A user demonstrated Qwen3.6 27B in FP8 with full BF16 KV cache running 200k token context at 80 tokens/sec on a single RTX 5000 PRO 48GB. The key insight: with enough VRAM, skip KV quantization entirely — the quality difference at long context is substantial versus quantized KV approaches on 24GB cards.
Link →
Google achieves 3x TPU inference speedup via diffusion-style speculative decoding
research to practice
r/LocalLLaMA
Google Developers Blog details a diffusion-style speculative decoding approach on TPUs that achieves 3x speedups for LLM inference. Unlike standard draft-model speculation, this uses a diffusion process to generate multiple candidate tokens simultaneously. Relevant if you're deploying on TPU infrastructure or designing custom serving stacks.
Link →
Qwen3.6 community ships merged chat template fix for tool calling
workflow
r/LocalLLaMA
Two independent contributors (froggeric and allanchan339) released fixed and merged chat templates for Qwen3.6 that resolve tool-calling issues in vLLM and llama.cpp. If you've been hitting silent failures with Qwen3.6 function calling, re-download templates now.
Link →
Radar
cocoindex: incremental engine for long-horizon agents
Trending on GitHub — an incremental indexing engine designed for agents that need to maintain state across long workflows over enterprise corpora. Worth watching if you're building agents that process evolving document sets.
Link →
OmniVoice: one-shot voice cloning that actually works
Getting enthusiastic community response for dead-simple one-shot voice cloning. If you need voice synthesis in a local pipeline without the complexity of multi-step training, this is generating buzz as the easiest on-ramp.
Link →
Heretic 1.3: reproducible uncensoring with benchmarks
Adds reproducible model outputs and integrated benchmarking to the censorship-removal tool, plus reduced peak VRAM. Useful if you need uncensored local models with verifiable quality.
Link →
Convergence Watch
multi-token prediction
TRENDING
3 mentions across HN Front Page, r/LocalLLaMA, Google Developers Blog
MTP is converging from multiple angles simultaneously: Google's official Gemma 4 MTP drafters, llama.cpp MTP beta from yesterday, community demos on AMD hardware, and Google's diffusion-style speculation paper. Speculative decoding is becoming a default serving optimization, not an experiment.
qwen 3.6
TRENDING
5 mentions across r/LocalLLaMA, GitHub Trending
Qwen 3.6 continues dominating local LLM discussion — today's signal is consolidation: fixed chat templates, long-context benchmarks on single GPUs, and head-to-head comparisons with Gemma 4. The ecosystem is maturing around it as the default open dense model.
local agentic coding
TRENDING
3 mentions across r/LocalLLaMA
Third consecutive day with multiple posts on running coding agents locally. Today's comparison of Claude Code vs OpenCode with Qwen3.6:27b shipping equivalent output signals local models crossing the coding-agent viability threshold.