A 26M-parameter model does tool calling at 6000 tok/s — agents just got pocketable.
Top Signal
Needle: 26M-Parameter Model Does Full Tool Calling on Budget Phones
new tool
HN Show, r/LocalLLaMA
Cactus Compute open-sourced Needle, a 26M-parameter function-calling model distilled from Gemini. It runs at 6,000 tok/s prefill and 1,200 tok/s decode on consumer devices — including budget phones. The model handles tool-use routing (deciding which function to call, extracting arguments) without needing a full LLM. This matters because agentic workflows currently require cloud calls or heavy local models just for the tool-dispatch step. Needle lets you run the orchestration layer on-device with near-zero latency, then call larger models only for the actual reasoning. Immediate use case: edge agents, IoT automation, and mobile apps that need function calling without API round-trips. The model is Apache 2.0 on GitHub. If you're building anything agentic that needs to run on constrained hardware, this is the missing piece between 'toy demo' and 'ships on a phone.'
Read more →
Fast Signals
Statewright: Visual State Machines That Make AI Agents Reliable
new tool
HN Show
Show HN from a 20-year NVIDIA/AMD veteran. Statewright lets you define agent control flow as visual state machines rather than prompt chains — the agent can only transition between explicitly defined states. Directly addresses the brittle-agents problem. Worth evaluating if you're building multi-step agents and tired of unpredictable branching.
Link →
React Doctor: Lint Layer That Catches Agent-Generated Bad React
new tool
GitHub Trending
From Million.co — a static analysis tool specifically designed to catch the patterns that AI coding agents produce in React code. Trending on GitHub. If you're using Cursor/Claude Code to write React, run this as a post-generation check.
Link →
MagicQuant v2.0: Hybrid Mixed GGUF Quants With Learned Configurations
workflow
r/LocalLLaMA
Five months of work on a pipeline that creates per-tensor hybrid GGUF quantization mixes, learning optimal assignments from Unsloth models. Handles architectures like Qwen3.6 27B that have irregular weight distributions. If you're serving quantized models locally and quality at low bit widths matters, this is a meaningful step beyond uniform quantization.
Link →
Luce DFlash+PFlash: 2.2x Decode, 3x Prefill on AMD Strix Halo
emerging signal
r/LocalLLaMA
Luce inference engine now delivers 2.23x decode and 3.05x prefill speedup over llama.cpp HIP for Qwen3.6-27B on AMD Strix Halo integrated GPUs. AMD local inference is becoming genuinely competitive. If you've been GPU-shopping for a local dev rig, the AMD calculus just changed.
Link →
Attention Drift: Why Speculative Decoding Breaks on Long Context
research to practice
r/LocalLLaMA
New research identifies 'attention drift' — drafter models in speculative decoding degrade under template perturbation and long-context inputs as the attention pattern shifts away from training distribution. Practical implication: if you're using speculative decoding in production, benchmark at your actual context lengths and prompt formats, not just short-context evals.
Link →
ggerganov Adds llama-eval: Official Eval Harness for llama.cpp
new tool
r/LocalLLaMA
Georgi Gerganov submitted a PR adding a built-in evaluation tool directly to llama.cpp. Previously you needed external harnesses (lm-eval, etc.) to benchmark GGUF models. Having eval as a first-class llama.cpp example lowers the bar for comparing quants and models locally.
Link →
Radar
OpenHuman: Private Personal AI Super-Intelligence
Trending on GitHub — a self-hosted personal AI system positioning itself as a private alternative to cloud assistants. Worth watching if the architecture enables meaningful local-first agent workflows beyond chat.
Link →
llama.cpp CPU-MoE Prompt Processing Speedup
A PR drastically improves prompt processing for partially GPU-offloaded MoE models using --n-cpu-moe. If you run large MoE models with split GPU/CPU inference, this could eliminate your prefill bottleneck.
Link →
Convergence Watch
multi-token prediction
TRENDING
3 mentions across r/LocalLLaMA
Day 7 of sustained MTP activity. Today's signal: Gemma 4 MTP vs DFlash H100 benchmarks, Qwen3.6 27B MTP at 256k context on RTX 5090, and the attention drift paper revealing failure modes in speculative decoding. MTP is moving from 'neat speedup' to 'understand the tradeoffs' phase.
local agentic coding
TRENDING
3 mentions across r/LocalLLaMA
Day 7 consecutive. Today: practical single-GPU setup guide (RTX 5080 + Qwen3.6-35B-A3B), 'build Claude Code from scratch' tutorial, and high-VRAM model selection discussion. The community has moved past 'can we do it' to 'what's the optimal stack.' Statewright and Needle both feed this trend from the tooling side.
needle
2 mentions across HN Show, r/LocalLLaMA
New entry. A 26M function-calling model appearing simultaneously on HN Show and r/LocalLLaMA suggests genuine community interest in ultra-small agentic models. Watch for adoption signals this week.
STALE: Latent Space newest item is >48h old