Eagle3 lands for Qwen, GLM-5.2 goes local and tops GPT-5.5 on agentic evals—local inference is compounding fast.
Top Signal
Eagle3 speculative decoding ships for Qwen 3.5/3.6 in llama.cpp b9723
platform change
r/LocalLLaMA
llama.cpp release b9723 adds Eagle3 speculative decoding for Qwen 3.5 and 3.6 via `--spec-type draft-eagle3`. Eagle3 uses a small draft model to speculatively predict tokens that the full model batch-verifies—typical speedups are 2–3x throughput with no output quality change. This matters now because Qwen 3.6 27B is the most capable locally-runnable model for coding agents, and a separate post today showed 55 tok/s at 262K context on 4x RTX 5060 Ti. Eagle3 should push that meaningfully higher. **Action**: upgrade to llama.cpp b9723, grab a Qwen3.5 or 3.6 27B GGUF plus a matching Eagle3 draft model, and add `--spec-type draft-eagle3` to your llama-server command. This is the fastest free throughput gain available for local inference today.
Read more →
Fast Signals
GLM-5.2 beats GPT-5.5 on agentic eval, now runs locally in llama.cpp
platform change
r/LocalLLaMA, GitHub Trending
GLM-5.2 claimed #1 on Artificial Analysis' AA-Briefcase agentic knowledge-work benchmark—above GPT-5.5. On the same day, GGUF support landed in llama.cpp and Unsloth Studio, making it locally runnable for the first time. REAP50-GGUF quantized variants (182GB range) are already circulating. This is the first open-weights model to surpass a frontier proprietary model on an agentic task benchmark worth trusting.
Link →
Full Deep Research agent open-sourced: training code, data, weights
research to practice
r/LocalLLaMA
A research team trained a Deep Research-style agentic web-research system on 32 H100s and open-sourced the entire pipeline—training code, dataset, and model weights. This is the first time a complete training recipe for frontier-level deep research agents is publicly available, not just inference weights. Starting point for anyone fine-tuning research agent capability on domain-specific corpora.
Link →
MCP's real value: auth isolation outside the agent context window
workflow
Simon Willison
Sean Lynch (via Simon Willison) offers the sharpest framing yet of MCP's architectural value: it keeps auth flows out of the model's context window entirely—and potentially out of the harness. This matters for multi-tenant systems and any agent where credential exposure is a risk. Worth auditing whether your current tool integrations leak auth material into model-visible memory.
Link →
Hyper-Extract: text → knowledge graphs/hypergraphs with one CLI command
new tool
GitHub Trending
New GitHub Trending tool (individual developer, low star count) that converts unstructured text into structured knowledge graphs, hypergraphs, or spatio-temporal extractions using LLMs—one command, no custom prompting scaffolding required. If you're building RAG pipelines or entity extraction steps, this is worth a spike before writing bespoke extraction code.
Link →
Local voice agent floor: Qwen 9B barely holds agentic reasoning, 0.8B collapses
research to practice
r/LocalLLaMA
Controlled experiment stepped Qwen 3.5 from 9B down to 0.8B with identical prompts, tools, and environment for a voice agent. Agentic tool-use degraded non-linearly—9B was the practical floor for coherent multi-step chains; 4B was marginal; 2B and below broke consistently. Concrete data point for hardware budget decisions when speccing local agent deployments.
Link →
4x RTX 5060 Ti P2P: Qwen 3.6 27B FP8 at 262K context, 55 tok/s for ~$1,800
emerging signal
r/LocalLLaMA
Community benchmark: 4x RTX 5060 Ti (16GB each, NVLink P2P), sourced from Facebook Marketplace for ~$1,800 total, runs Qwen3.6-27B-FP8 at 55 tok/s with 262K context and BF16 KV cache. Concrete proof that sub-$2K consumer hardware now delivers long-context production-grade inference speeds. Reference configuration for anyone building a dedicated local inference box.
Link →
Radar
Talos: formal WASM verification for AI-written code (YC W26)
Cajal (YC W26) open-sourced a framework for formally verifying WebAssembly modules in Lean. As AI generates more production code, WASM is a natural verification boundary—compile AI-written code to WASM, verify before deploy. Too early for most builders but the problem it addresses will matter more every quarter.
Link →
LTX-2: open audio-video model with built-in LoRA trainer
Lightricks' LTX-2 landed on GitHub Trending—audio-driven video generation with an included LoRA fine-tuning pipeline. If video generation is on your roadmap, this is the open-source path with fine-tuning capability baked in rather than bolted on.
Link →
Datasette Apps: embed arbitrary HTML apps inside Datasette
New datasette-apps plugin lets you host custom HTML applications inside Datasette instances with per-resource ACL controls. Useful pattern for data-adjacent internal tooling that needs auth without spinning up a separate backend service.
Link →
Convergence Watch
glm-5.2
TRENDING
9 mentions across r/LocalLLaMA, GitHub Trending
Four consecutive days of multi-source coverage, and today marked a qualitative shift: GLM-5.2 both claimed #1 on AA-Briefcase (above GPT-5.5) and became locally runnable in llama.cpp simultaneously. The community distillation pipeline is forming around it (REAP50-GGUF variants appearing). This is the benchmark open-weights model for agentic work right now—if you haven't evaluated it, you're behind.
eagle3 speculative decoding
TRENDING
2 mentions across r/LocalLLaMA
Eagle3 was initially merged for general llama.cpp use (6/13–6/14), and is now rolling out to specific high-demand model families—Qwen 3.5 and 3.6 landed today. The pattern suggests progressive expansion to more model families. Eagle3 is becoming the default throughput layer for local llama.cpp inference; expect it to spread to GLM-5.2 GGUF next.