Your coding agent scaffold matters more than your model — a 2.4x gap proves it.
Top Signal
Scaffold design doubles local coding agent performance at fixed model size
workflow
r/LocalLLaMA
A developer held Qwen 9B weights constant and swapped only the agent scaffold — jumping from 19.1% on Aider's benchmark to 45.6% with a scaffold designed for small local models. The key insight: most coding agent frameworks (Aider, Claude Code, opencode) are optimized for frontier-class models and waste context on patterns small models can't exploit. The adapted scaffold uses shorter system prompts, tighter tool schemas, and single-step edits instead of multi-turn planning. This is directly actionable: if you're running local models as coding agents, the framework choice may matter more than jumping to the next model size. Test your current scaffold against alternatives before upgrading hardware. The author's approach — minimal system prompt, explicit edit format, no chain-of-thought forcing — is a template worth stealing.
Read more →
Fast Signals
llama.cpp merges speculative checkpointing for faster local inference
platform change
r/LocalLLaMA
Speculative checkpointing landed in llama.cpp mainline, enabling the runtime to checkpoint draft model state and resume on rejection instead of recomputing. Speedups are task-dependent — repetitive code generation and structured output see the biggest gains. Combined with ngram-map speculative decoding, users report up to 665% speed increases on code edit tasks.
Link →
TRELLIS.2 image-to-3D ported to Apple Silicon — no NVIDIA needed
new tool
HN Front Page, r/LocalLLaMA
A developer replaced all five CUDA-only compiled extensions in Microsoft's TRELLIS.2 (4B param image-to-3D model) with pure PyTorch MPS backends. This means 3D asset generation from a single image now runs on any M-series Mac. If you're building spatial computing or AR/VR features, this removes the NVIDIA dependency entirely for prototyping 3D pipelines.
Link →
Gemma 4 E2B runs prompt-to-Excalidraw diagrams entirely in-browser
workflow
HN Front Page
A Show HN demo runs Google's Gemma 4 model (3.1GB) via WebAssembly to generate Excalidraw diagrams from natural language prompts — no server, no API key. This is the clearest demo yet of useful in-browser LLM inference for structured output. If you're building tools that need diagram generation without backend costs, this pattern is ready to copy.
Link →
Matt Webb argues headless services are the next AI integration pattern
emerging signal
Simon Willison
Simon Willison highlights Matt Webb's thesis: as personal AI agents become the primary interface, services that strip their UI layer and expose pure API/headless backends will win. The implication for builders — design your services API-first with the assumption that an AI agent, not a human, is the primary consumer. This reframes how you think about onboarding, auth flows, and documentation.
Link →
Vercel confirms April 2026 security breach — check your deploy secrets
platform change
HN Front Page
Vercel disclosed a security incident with hackers claiming to sell stolen data. If you deploy on Vercel, rotate your environment variables and API keys now. Review your CI/CD pipeline for any secrets that passed through Vercel's infrastructure.
Link →
RAM shortage may last years — plan local inference hardware accordingly
platform change
HN Front Page, r/LocalLLaMA
The Verge reports the global RAM shortage driven by AI server demand could persist for years. SK hynix is ramping 192GB LPDDR5X SOCAMM2 modules for NVIDIA AI servers, but consumer supply remains constrained. If you're planning local inference hardware purchases, buy sooner rather than later — prices and availability are unlikely to improve near-term.
Link →
Radar
RuView: WiFi signals → human pose estimation, no camera
Uses commodity WiFi signals for real-time DensePose estimation, vital sign monitoring, and presence detection without any video. If you're building ambient computing or privacy-preserving spatial awareness, this is a fundamentally different sensing approach worth bookmarking.
Link →
LLM Neuroanatomy III: models think in geometry, not language
Research post presenting evidence that LLM internal representations are geometric structures rather than linguistic ones. Early but potentially important for anyone building interpretability tools or doing activation steering — suggests spatial probing may be more fruitful than token-level analysis.
Link →
Convergence Watch
qwen 3.6
TRENDING
10 mentions across r/LocalLLaMA
Qwen 3.6-35B-A3B is dominating local LLM discussion for a fourth straight day. Users are now stress-testing it as a daily coding driver, comparing it to Opus 4.7, and pushing creative generation (isometric rooms, browser OS). The model is crossing from 'impressive benchmark' to 'people actually switching their workflows.' The ik_llama inference backend is emerging as the preferred runner.
claude code ecosystem tooling
TRENDING
2 mentions across r/LocalLLaMA, Simon Willison
Sixth consecutive day of elevated Claude Code tooling activity. Today's signal includes users getting banned and seeking local replacements, plus Simon Willison's token counter comparing model costs. The ecosystem is maturing but lock-in risk is becoming a real concern — have a local fallback plan.
local inference optimization
4 mentions across r/LocalLLaMA
Speculative checkpointing, speculative decoding with ngram maps, and ik_llama speed improvements are converging into a theme: local inference is getting dramatically faster at the runtime level. Combined with Qwen 3.6's efficiency, the gap between local and cloud coding agents is narrowing faster than expected.
STALE: Latent Space newest item is >48h old