A quantization breakthrough fixes what everyone assumed was just rounding error.
Top Signal
Wasserstein metric fixes tensor drift in quantized GGUF models
research to practice
r/LocalLLaMA
A LocalLLaMA contributor discovered that the Wasserstein metric (W1) detects and corrects ssm_conv1d tensor drift in quantized GGUF models far better than the standard Kullback-Leibler divergence. The technique identifies numerical instabilities that accumulate during quantization — errors previously dismissed as acceptable rounding noise that actually degrade model output quality. The first proof-of-concept ships as an uncensored Qwen 3.6-35B-A3B GGUF quant. If you're shipping quantized models in production or running local inference, this matters: it means your Q4/Q5 quants may have systematic errors that are fixable. Watch for this to get integrated into llama.cpp's quantization pipeline. Bookmark the Wasserstein metric Wikipedia page and the released GGUF as a reference implementation.
Read more →
Fast Signals
Qwen 3.6-35B-A3B passes real coding tests that Qwen 3.5-27B failed
platform change
r/LocalLLaMA, GitHub Trending, HN Front Page
Independent testing confirms Qwen 3.6's MoE architecture delivers genuine coding capability gains over its predecessor, not just benchmark inflation. Users report running it with 8-bit quant and 64k context on M5 Max MacBooks at usable speeds, with multiple vLLM + Docker deployment guides now circulating. The model has dominated LocalLLaMA for 48 hours straight — this is the local coding agent to evaluate right now.
Link →
DeepGEMM unifies FP8, FP4, and BF16 kernels with fused MoE support
new tool
GitHub Trending
DeepSeek's DeepGEMM library is trending on GitHub — it consolidates the key compute primitives for modern LLMs (FP8/FP4/BF16 GEMMs plus fused MoE) into a single clean kernel library. If you're optimizing inference infrastructure or building custom serving stacks, this replaces cobbling together separate kernel implementations.
Link →
Cloudflare's Unweight hits LocalLLaMA — 22% lossless LLM compression
new tool
r/LocalLLaMA, GitHub Trending
Cloudflare's open-source Unweight tool, which compresses LLM weights 15-22% without quality loss, is now getting traction on r/LocalLLaMA after its initial release. Day 2 of cross-source spread suggests this will become a standard step in local deployment pipelines. Worth testing on your most-used models.
Link →
Thunderbolt: Thunderbird ships vendor-neutral local AI chat
new tool
GitHub Trending
Mozilla's Thunderbird team released Thunderbolt, an open-source local AI chat app with the tagline 'choose your models, own your data, eliminate vendor lock-in.' Trending on GitHub. Interesting as a signal that established open-source projects are building AI features as standalone, model-agnostic tools rather than bolting on API calls.
Link →
Claude Opus 4.7 token usage diverges sharply from 4.6 — 510 HN points
platform change
HN Front Page
An anonymous token comparison leaderboard shows Opus 4.7 using significantly different token counts than 4.6 for equivalent tasks. At 510 HN points and 498 comments, this is the most-discussed AI cost topic this week. If you're budgeting API costs around Claude, audit your token usage after upgrading.
Link →
Radar
Prefill-as-a-Service: cross-datacenter KV cache sharing
Research proposal for sharing KV caches across datacenters to amortize prefill costs for next-gen models. Early-stage but architecturally significant — if this works, it changes the economics of serving long-context models at scale.
Link →
MDV: Markdown superset with live data and dashboards
Show HN project (111 points) extending Markdown with live data binding, dashboard layouts, and slide generation. Worth watching as a potential agent output format — structured enough for data, readable enough for humans.
Link →
Convergence Watch
qwen 3.6
TRENDING
44 mentions across r/LocalLLaMA, GitHub Trending, HN Front Page
Seven consecutive days of coverage, exploding from 15 to 44 mentions. The community has moved past benchmarks into real deployment — vLLM configs, quantization fixes, and head-to-head coding tests. This is the new default local coding model.
claude code ecosystem tooling
TRENDING
8 mentions across HN Show, GitHub Trending, HN Front Page
Seven consecutive days across 3 sources, mentions doubled today to 8. The Claude Code extension ecosystem is maturing rapidly — expect the tooling layer to stabilize around winners within weeks.
cloudflare unweight
2 mentions across r/LocalLLaMA, GitHub Trending
Day 2 of cross-source spread. Lossless compression at 22% is compelling enough that adoption seems inevitable for local deployment. Watch for llama.cpp integration.
STALE: Latent Space newest item is >48h old