Two independent teams ship agent code-search tools on the same day — the token-waste problem has a solution.
Top Signal
Semble + CodeGraph: two tools solve agent code search in one day
new tool
HN Show, GitHub Trending
Two independent projects dropped today attacking the same problem: AI agents wasting tokens on brute-force code navigation. Semble (Show HN, MinishLab) uses semantic vector search to replace grep in agent loops — 98% fewer tokens, still finds the right code, works with any LLM agent on any codebase. CodeGraph (GitHub Trending) pre-indexes your repo into a knowledge graph specifically for Claude Code — claiming 94% fewer tool calls and 77% faster exploration, 100% local. Both are open source and installable today. This convergence is signal: the pattern of agents falling back to grep-everything when lost is a real, widely-felt pain point that now has two independent solutions in one day. Try Semble first if you're language-agnostic or using non-Claude agents. Use CodeGraph if you're Claude Code-heavy and want semantic indexing baked into the workflow. Bookmark both.
Read more →
Fast Signals
DeepSeek V4 1M context: production limits mapped at 520k tokens
research to practice
r/LocalLLaMA
Field test against three real codebases (45k, 180k, 520k tokens) shows DeepSeek V4's 1M claim starts degrading on dependency tracing and cross-file refactors at the 520k full-stack scale. Useful ground truth before architecting anything around million-token context windows.
Link →
vLLM vs SGLang vs llama.cpp benchmarked on mixed Blackwell/Ada cluster
research to practice
r/LocalLLaMA
7-GPU heterogeneous cluster (RTX PRO 6000 96GB + PRO 5000 48GB + consumer cards) stress-tested across inference engines with long-context pipeline parallelism. Rare real-world data for teams running mixed-hardware setups rather than homogeneous cloud instances — read before choosing your inference stack.
Link →
Abliterlitics: open forensics toolkit for abliteration methods
new tool
r/LocalLLaMA
85 GPU-hours comparing 5 abliteration techniques on the same Qwen3.6-27B base — with weight forensics, benchmark deltas, and safety regression analysis. If you're deploying on post-trained or modified open weights, this gives you a structured audit framework rather than vibes-based trust.
Link →
ROCm 7.13 ships native Strix Halo optimizations
platform change
r/LocalLLaMA
AMD's ROCm 7.13 nightly targets the Ryzen AI Max 300 'Strix Halo' APU — the unified-memory chip getting real traction for local inference. ROCprof Trace Decoder also goes open-source. If Strix Halo is on your hardware roadmap, this is the release that makes it viable for production LLM workloads.
Link →
MiroThinker-1.7: open-weight deep research agent on Qwen3 MoE
new tool
r/LocalLLaMA
Open-weight deep research agent (30B total, 3B active for mini) with weights on HuggingFace — a local alternative to closed deep research products. Weights are out, community is benchmarking on consumer hardware now. Worth watching if you need autonomous multi-step research without API costs.
Link →
Apple Silicon local inference costs more per token than OpenRouter
research to practice
HN Front Page, r/LocalLLaMA
Energy cost analysis surfaces in two independent feeds: M-series local inference exceeds OpenRouter pricing on electricity alone at current rates. The math inverts if providers are burning VC subsidies or you're amortizing hardware across non-AI workloads. Concrete calibration data for your build-vs-buy decision.
Link →
Radar
Grafting vision onto text-only local models
Community exploration of whether Mistral's separately-released vision encoder can be grafted onto base text models that shipped without multimodal support — llama.cpp's architecture already separates vision weights. Early-stage but could unlock vision in models where it was stripped at release.
Link →
KV sharing, mHC, compressed attention: architecture survey
Community writeup mapping three emerging attention architecture techniques for KV cache reduction. Relevant for tracking which next-gen model families will have structural long-context efficiency advantages — not hype, actual architectural choices with measurable VRAM impact.
Link →
Qwen3.6-35B-A3B replacing Cursor for 500k-LOC enterprise dev
First credible field report of a developer doing 60 hours/week of professional work on a 500-700k LOC enterprise suite switching entirely to local Qwen3.6-35B-A3B. The ceiling for local MoE models as daily coding tools just moved up.
Link →
Convergence Watch
agent code search efficiency
2 mentions across HN Show, GitHub Trending
Semble and CodeGraph launched independently on the same day, both targeting token waste in agent code navigation. Two teams reaching the same product conclusion simultaneously is a strong signal the problem is widely felt and not yet solved by existing tooling. Watch for a third entrant to confirm the category.
multi-token prediction
TRENDING
8 mentions across r/LocalLLaMA
MTP community benchmarking continues across hardware (RTX 5090, 7900xtx, 6GB VRAM laptop, Qwen3.6 variants). A clear verdict is emerging: MTP gains are VRAM-dependent — high-end GPUs see meaningful speedups, memory-constrained hardware sees flat or negative results. Now day 7 of consistent coverage.
apple silicon vs openrouter cost
2 mentions across HN Front Page, r/LocalLLaMA
Same energy-cost analysis surfaced independently on two feeds. The conclusion — local inference costs more at current electricity rates — is actionable for anyone debating local vs. API inference. The counterargument (investor-subsidized cloud pricing) is also in the thread.
STALE: Latent Space newest item is >48h old