GLM-5.2 GGUFs land as OSS models officially overtake proprietary traffic on OpenRouter — the open frontier just got real.
Top Signal
GLM-5.2 goes fully practical: GGUFs, deploy configs, and a distillation pipeline forming
platform change
Simon Willison, HN Front Page, r/LocalLLaMA
Z.ai's 753B MIT-licensed GLM-5.2 crossed from 'impressive API model' to 'something builders can actually run' today. Unsloth uploaded GGUFs from 2-bit (238GB) through full precision. Community-shared HGX-H200/SGLang docker configs are circulating. Simon Willison published the first comprehensive technical writeup, confirming it sits #3 overall on Artificial Analysis (behind only o3 and Gemini Ultra) and outperforms on long-horizon tasks and creative writing — areas where Claude has dominated. The Z.ai founder teased a GLM-fable-class open model before year-end. Most actionable angle today: a thread is already organizing to produce a large distillation dataset (700k–1M examples) from GLM-5.2 outputs, which would let you fine-tune Qwen3.x or similar at a fraction of the cost. Try the API via z.ai or HuggingFace — this is the highest-signal open model for production evaluation right now.
Read more →
Fast Signals
codebase-memory-mcp: persistent knowledge graph for your whole repo, 99% fewer tokens
new tool
GitHub Trending
Single static binary MCP server that indexes any codebase into a persistent knowledge graph in milliseconds — 158 languages, sub-ms queries, zero dependencies. The 99% token reduction claim is the headline; if it holds under real agent workloads, this changes how you architect code-aware agents that need full-repo context without blowing context windows.
Link →
rtk + headroom + caveman: three tools to cut LLM token costs on real workloads
workflow
r/LocalLLaMA
Post benchmarking three obscure token optimization tools — rtk (request token kitting), headroom (context window management), and caveman (prompt compression) — against actual production workloads, not synthetic tests. Savings are measured and concrete. Bookmark if you're spending >$500/month on tokens.
Link →
OSS models officially overtake proprietary traffic on OpenRouter
emerging signal
r/LocalLLaMA
Three months of OpenRouter request-volume data shows open-source models have decisively crossed proprietary — a first. Builders are routing production traffic to Qwen, Llama, and GLM variants at scale. If you're still defaulting all calls to GPT-4o or Claude, your cost-per-token math needs revisiting.
Link →
RLM: plug-and-play inference library for Recursive Language Models
new tool
GitHub Trending
GitHub Trending library for models that iteratively refine output through recursion rather than a single autoregressive pass, with sandbox support and a drop-in inference API. Early-stage but the architecture is distinct from chain-of-thought — worth watching if you're building iterative reasoning pipelines.
Link →
Liquid AI drops LFM2.5-Embedding-350M and ColBERT-350M simultaneously
new tool
r/LocalLLaMA
Liquid AI released both a dense embedding model and a late-interaction ColBERT re-ranker at the same 350M scale in a single drop. Having retrieval and re-ranking from the same architecture family eliminates distribution mismatch — slot both into your RAG stack and benchmark retrieval quality today.
Link →
Poolside Laguna-M.1 (225B-A23B MoE) drops on HuggingFace
new tool
r/LocalLLaMA
Poolside — the code-focused AI lab that's been building in stealth — quietly released Laguna-M.1, a 225B-active-23B MoE model publicly on HuggingFace. No benchmark sheet yet, but their code-first training focus and MoE architecture make it worth running against GLM-5.2 on your specific coding tasks before the community benchmarks land.
Link →
10,000 GitHub repos actively distributing Trojan malware — ongoing campaign
platform change
HN Front Page
Researcher documented an active campaign: 10k+ GitHub repositories spreading Trojan malware, targeting developers who install dependencies directly from GitHub URLs. If your CI/CD pipeline pulls from GitHub source rather than verified package registries, audit your lockfiles and dependency sources now.
Link →
Radar
DiffusionGemma 26B hits 475 tok/s on a consumer 4090
Diffusion-based language model architecture (non-autoregressive) running Gemma 26B at 475 tok/s on a 4090 via vLLM with AWQ-INT4. If diffusion LLMs can reach this speed while matching autoregressive output quality, the latency assumptions underlying most production inference stacks need revisiting.
Link →
OpenMontage: open-source agentic video production, 52 tools
First open-source system claiming full agentic video production: 12 pipelines, 52 tools, 500+ agent skills. The kind of agent orchestration framework that typically gets productized before it goes open — worth watching if you build creative content pipelines or are evaluating multi-tool agent architectures.
Link →
Physical gas sensor modulates LLM sampler params live
A builder wired a real gas sensor to dynamically adjust temperature/top_p/top_k for a local model in real time — smoke literally shifts sampling distributions live. Points at an underexplored technique: using physical-world sensor data as dynamic sampling constraints in embodied or edge AI systems.
Link →
TRELLIS.2 image-to-3D now runs natively on Apple Silicon via MLX
One of the best open image-to-3D models now has native MLX support, eliminating the CUDA requirement for Mac users. If you're building product visualization or 3D asset pipelines, the hardware barrier just dropped significantly — test on M-series hardware without spinning up a GPU instance.
Link →
Convergence Watch
glm-5.2
TRENDING
15 mentions across Simon Willison, HN Front Page, r/LocalLLaMA
Day 6 of coverage but today crossed a practical threshold: GGUFs live, deploy configs shared publicly, Simon Willison's comprehensive writeup published, and community self-organizing to produce distillation datasets. The model has moved from 'impressive benchmark result' to 'infrastructure decision' — builders should evaluate it for production before the distillation wave hits and the small-model landscape shifts.