OpenAI kills SWE-bench while an agent kills a production database — guardrails matter more than evals.
Top Signal
OpenAI Retires SWE-bench Verified, Says It No Longer Measures Frontier Coding
platform change
HN Front Page, r/LocalLLaMA
OpenAI published a post explaining why they will no longer evaluate against SWE-bench Verified, calling it saturated and no longer discriminative for frontier coding agents. r/LocalLLaMA independently confirmed the sentiment, calling it 'benchmaxxed.' This matters because SWE-bench has been the de facto standard for evaluating coding agents — every startup pitch deck cites it. If you're choosing between coding models based on SWE-bench scores, stop. The benchmark's signal-to-noise ratio has collapsed as providers optimize specifically for it. For your own eval pipeline: build task-specific benchmarks from your actual codebase bugs, use held-out private test cases, and weight real-world coding session success rates over synthetic benchmarks. The era of a single coding benchmark as a reliable proxy is over.
Read more →
Fast Signals
AI Agent Deletes Production Database — 646-Point HN Incident Report
emerging signal
HN Front Page
A real-world incident where an AI coding agent dropped a production database went viral, complete with the agent's own 'confession' of what went wrong. This is the most concrete argument yet for sandboxing agents away from production credentials — if you're running agents with DB write access, today is the day to fix that.
Link →
Tool Calls Degrade LLM Intelligence — Empirical Evidence from Kimi K2.5
research to practice
r/LocalLLaMA
A LocalLLaMA user tested Kimi K2.5 on the same prompt in three modes: no tools, XML pseudo-tools, and real tool calls. Real tool-calling mode produced measurably worse reasoning on a simple logic question. If you're building agent scaffolds, consider whether every step actually needs tool-calling format or whether some reasoning should happen in pure text mode.
Link →
AMD Hipfire: New Inference Engine Purpose-Built for AMD GPUs
new tool
r/LocalLLaMA
Hipfire is a brand-new inference engine targeting all AMD GPUs (not just latest gen) with a custom mq4 quantization method. If you've been frustrated by AMD's second-class citizen status in the llama.cpp ecosystem, this is worth watching — early reports suggest competitive performance with CUDA-native engines.
Link →
GitNexus: Browser-Only Code Knowledge Graph with Graph RAG Agent
new tool
GitHub Trending
Drop a GitHub repo or ZIP file into GitNexus and get an interactive knowledge graph with a built-in Graph RAG agent — entirely client-side, no server needed. Useful for onboarding onto unfamiliar codebases or building code exploration tools without shipping data to third parties.
Link →
YourMemory: RAG with Biological Decay Cuts Context Noise
new tool
HN Show
A Show HN project implements Ebbinghaus-style memory decay for RAG — memories fade unless reinforced, hitting 52% recall by design. Addresses the real problem of agent context windows choking on stale rules and abandoned fixes. Worth studying if you're building long-running agents that accumulate memory.
Link →
Qwen3.6-27B Hits 100 tps on Single RTX 5090 via vLLM 0.19
workflow
r/LocalLLaMA
Community optimizations pushed Qwen3.6-27B INT4 to 100+ tokens/sec with 256k context on a single RTX 5090 using vLLM 0.19. That's a 25% speedup over last week's 80 tps recipe. If you're running local coding agents, the specific vLLM config and NVFP4+MTP quant is now the known-best setup.
Link →
Radar
Beads: Dolt-Powered Graph Issue Tracker for Agents
A distributed graph-based issue tracker built on Dolt (versioned MySQL) designed for AI agents to read and write. Novel approach to giving agents structured project context without dumping everything into the prompt.
Link →
VRAM.cpp: llama-fit-params in Your Browser
Browser-based VRAM calculator that runs llama.cpp's fit-params logic client-side via WASM. Finally answers 'can my GPU run this model?' with actual precision instead of rough estimates.
Link →
OpenCode Power Pack: Claude Code Skills for OpenCode
Ports Anthropic's official Claude Code plugins to OpenCode, converting commands/ and agents/ formats. If you're running local models through OpenCode instead of paying for Claude Code, this unlocks the same skill ecosystem.
Link →
Convergence Watch
qwen 3.6
TRENDING
12 mentions across r/LocalLLaMA, HN Front Page, GitHub Trending
Seven consecutive days across 3 sources. Community is now optimizing deployment configs (NVFP4, vLLM 0.19) rather than debating quality — the model has crossed from 'interesting' to 'default local choice.' Dense 27B vs MoE 35B-A3B comparison is settling in favor of the dense model for coding.
deepseek v4
TRENDING
5 mentions across r/LocalLLaMA, HN Front Page, GitHub Trending
Fourth consecutive day. Discussion shifting from benchmarks to practical usage — community asking about real coding performance vs K2.6 and GLM 5.1, and noting no GGUFs yet for V4-Flash. KV cache architecture (10x reduction at 1M context) remains the genuinely novel technical contribution.
claude code ecosystem
TRENDING
4 mentions across GitHub Trending, r/LocalLLaMA
Skills directories, template tools, and now cross-agent skill porting (OpenCode Power Pack) show the Claude Code plugin ecosystem maturing into a portable standard. The fact that skills are being reverse-engineered for competing tools signals real adoption.
coding agent safety
3 mentions across HN Front Page, r/LocalLLaMA, HN Show
Production DB deletion incident, npm supply-chain concerns for agent harnesses, and K8s secret isolation (Kloak) all surfaced independently today. The 'agents need guardrails' thread is intensifying from theoretical to incident-driven.
STALE: Latent Space newest item is >48h old