MTP lands in llama.cpp master today — local inference just got meaningfully faster without new hardware.
Top Signal
Multi-Token Prediction merges into llama.cpp master (PR #22673)
platform change
r/LocalLLaMA
PR #22673 hit llama.cpp master today, making Multi-Token Prediction official after weeks of community testing on pre-release branches. MTP lets llama.cpp predict multiple tokens per forward pass — like speculative decoding but without a separate draft model. Real benchmarks on Strix Halo show Qwen3.6 27B going from 87.4s to 77.4s wall-clock (11.5%), with community reports of 1.2–1.5x token/sec gains on coding tasks. Creative tasks and some 35B dual-GPU configs show mixed or negative results. To use it: pull llama.cpp master today, grab an MTP-tagged quant (unsloth has Qwen3.6 MTP variants), and add `--spec-type draft-mtp --spec-draft-n-max 3` to your llama-server command. If you're running local inference for coding agents, this is the fastest free speed upgrade available right now with no model quality tradeoff.
Read more →
Fast Signals
DeepSeek-V4-Flash makes steering vectors practical for builders
research to practice
HN Front Page
Sean Goedecke shows that DS-V4-Flash's open weights finally make activation steering usable in production — add vectors to residual streams to control tone, persona, or behavior without prompting or fine-tuning. Unlike closed frontier models, you can inspect and modify internal representations. Actionable if you need deterministic behavioral control beyond system prompts.
Link →
SANA-WM: 2.6B open model generates 1-minute 720p video
new tool
HN Front Page
NVIDIA Labs releases SANA-WM, a 2.6B world model capable of generating 60-second 720p video — dramatically smaller than competing architectures. Open-source and runnable locally. If video generation is part of your pipeline, this is the most accessible high-resolution open model available.
Link →
Delta-Mem: efficient online memory updates for long-running agents
research to practice
HN Front Page
New arxiv paper proposes Delta-Mem, a method for updating working memory in LLMs without full context reprocessing — keeps agent memory fresh incrementally. 189 HN points suggests the approach resonates. Worth reading if you're building agents that operate over extended sessions where context growth is a cost or latency problem.
Link →
anthropics/skills goes public alongside agentskills.io standard
platform change
GitHub Trending
Anthropic published the full skills repo for Claude Code agents, tied to the emerging agentskills.io open standard for composable agent skills. Trending on GitHub today. If you build Claude-based tooling or agentic workflows, this repo is the canonical reference for the skill packaging format that's becoming the default.
Link →
Qwen3.6-35B beats Gemini 2.5 Pro on Terminal-Bench 2.0
emerging signal
r/LocalLLaMA
Running little-coder × Qwen3.6-35B-A3B locally hits 24.6% on Terminal-Bench 2.0, beating Gemini 2.5 Pro on Gemini CLI (19.6%) and Qwen3-Coder-480B (23.9%). A 35B local model outperforming frontier on real terminal/agentic tasks is a meaningful data point for teams weighing cloud vs. local for coding agents.
Link →
datasette-llm-limits: rate-limit LLM calls per user in Datasette apps
new tool
Simon Willison
Simon Willison ships a plugin that enforces per-user or per-query LLM call limits in datasette-llm deployments. Trivial to add but solves a real production problem: runaway LLM costs from shared data apps. Bookmark if you expose Datasette to end users with LLM features.
Link →
Radar
Nexidion: local AI background worker for private knowledge
Open-sourced after 2 years and 5 architectural rewrites — a self-hosted knowledge vault where a local AI agent continuously processes and indexes your data in the background, not just on-demand retrieval. Different architectural bet than standard RAG: always-on background workers vs. query-time retrieval.
Link →
OpenReader: open-source read-along TTS with synchronized highlighting
Self-hostable document reader server with high-quality TTS, word-level synchronized highlighting, and audiobook export for EPUB/PDF/DOCX/MD. Useful pattern for accessibility features in document-heavy agents — the synchronized highlighting approach translates directly to agent annotation UIs.
Link →
Convergence Watch
multi-token prediction
TRENDING
8 mentions across r/LocalLLaMA
Seven consecutive days of coverage culminating in today's official llama.cpp master merge. Community immediately flooded with config sharing, benchmarks, and hardware-specific results. This is no longer a speculative technique — it's shipped infrastructure. The conversation has shifted from 'does MTP work?' to 'what's the optimal config per GPU setup?'