MTP's dirty secret: it only speeds up coding tasks, slows creative generation.
Top Signal
MTP Benchmarks Reveal Task-Dependent Speedups: Coding Wins, Creative Loses
research to practice
r/LocalLLaMA
New benchmarks on Qwen 3.6 27B MTP quants show that multi-token prediction's throughput gains are entirely task-dependent. Coding and structured output tasks see the expected 2-2.5x speedup, but creative writing and open-ended generation actually gets SLOWER due to lower draft acceptance rates. The finding matters because MTP has been the hottest local inference optimization all week — builders deploying speculative decoding need to route requests appropriately. If you're building a coding agent, enable MTP aggressively. If you're building a creative writing tool or chatbot, stick to standard autoregressive decoding or implement task-aware routing that switches decoding strategy based on the prompt type.
Read more →
Fast Signals
llama.cpp b9095: NCCL-Free Tensor Parallelism on Dual Blackwell GPUs
platform change
r/LocalLLaMA
llama.cpp finally enables `-sm tensor` on dual consumer Blackwell PCIe GPUs without NCCL dependency. This removes the biggest blocker for multi-GPU local inference on the new 5060 Ti and 5080 cards. If you've been waiting to split models across two Blackwell GPUs, update now.
Link →
DeepSeek-V4-Flash Hits 85 tok/s at 524K Context on 2x RTX PRO 6000
workflow
r/LocalLLaMA
A W4A16+FP8 quant of DeepSeek-V4-Flash with fixed MTP heads achieves 85 tok/s at 524K context and 111 tok/s at 128K on dual RTX PRO 6000 Max-Q. The key trick: the original MTP head quantization was broken, requiring a custom fix. This is near-API-speed performance for a 524K context window running locally.
Link →
NVIDIA Star Elastic: One Checkpoint Slices Into 30B, 23B, and 12B Models
research to practice
r/LocalLLaMA
NVIDIA released Star Elastic — a single model checkpoint you can slice into 30B, 23B, or 12B parameter reasoning models at inference time with zero additional training. This enables serving different capability tiers from one set of weights, simplifying deployment pipelines that need to scale quality vs. cost dynamically.
Link →
agentmemory: Persistent Memory Layer for AI Coding Agents
new tool
GitHub Trending
Trending on GitHub — a persistent memory system for coding agents benchmarked against real-world tasks. Designed to give agents cross-session context without re-reading entire codebases. Worth evaluating if you're building or extending coding agent tooling.
Link →
DS4: antirez Ships DeepSeek V4 Flash with 1M Context on Metal
new tool
r/LocalLLaMA
The Redis creator's DS4 project now runs DeepSeek V4 Flash with a 1M token context window on Apple Silicon via Metal. Significant for Mac-based builders who need large context local inference without CUDA.
Link →
Radar
cull: Open-Source Image Dataset Curation Pipeline
Scraping, classification, and captioning in one tool for training LoRAs and building reference libraries. Solves the tedious manual curation step that blocks most fine-tuning projects.
Link →
gemma-4-26b-a4b Excels at Three.js Generation
Users reporting that Google's small MoE model one-shots complex three.js scenes reliably. Suggests a capability pocket worth testing if you're building visual/3D generation pipelines with local models.
Link →
Convergence Watch
multi-token prediction
TRENDING
4 mentions across r/LocalLLaMA, GitHub Trending, HN Front Page
Day 6 of sustained MTP discussion. Today's new signal: benchmarks proving MTP is task-dependent (coding yes, creative no). The ecosystem is maturing from 'does it work?' to 'when exactly should I use it?' — a sign the technique is crossing into production readiness.
local agentic coding
TRENDING
3 mentions across r/LocalLLaMA, GitHub Trending, HN Front Page
7th consecutive day across 3+ sources. Today's additions: Pi as OpenCode alternative, llama.cpp RPC for distributed inference, and the 'Local AI needs to be the norm' post hitting 401 points. The toolchain for local coding agents is consolidating rapidly.
STALE: Latent Space newest item is >48h old