A 10x prefill breakthrough makes 128K context viable on consumer GPUs.
Top Signal
PFlash delivers 10x prefill speedup over llama.cpp at 128K context
new tool
r/LocalLLaMA
A new inference optimization called PFlash achieves 10x faster prefill than llama.cpp when processing 128K-token contexts on a single RTX 3090. Long-context prefill has been the bottleneck keeping local inference impractical for large documents and codebases — this directly changes that calculus. The technique appears to be a custom flash-attention implementation optimized for the prefill pass specifically, separate from the decode path. Combined with today's other DFlash news (speculative decoding working on 8GB cards, Gemma 4 DFlash variants shipping), we're seeing a wave of inference-level optimizations that make running frontier-class open models locally increasingly viable. If you're building anything that processes long documents or codebases with local models, benchmark PFlash against your current llama.cpp setup — the gains at high context lengths are dramatic.
Read more →
Fast Signals
Browserbase ships Claude Agent SDK skills for web browsing
new tool
GitHub Trending
Browserbase released a skills package that gives Claude Code agents full browser automation via the official bb CLI. This is a clean integration point if you're building agents that need to interact with web UIs — scraping, testing, or multi-step browser workflows without rolling your own Playwright harness.
Link →
Pu.sh: full coding agent harness in 400 lines of shell
new tool
HN Show
A Show HN project that implements a complete coding agent in pure shell — no Python, no Node, no SDK dependencies. Born from pi-autoresearch experiments with a self-imposed constraint of maximum portability. Worth studying as a reference for how little infrastructure you actually need to build a functional coding agent.
Link →
DFlash speculative decoding runs 35B MoE on an 8GB RTX 2080
workflow
r/LocalLLaMA
A user got DFlash speculative decoding working with Qwen3.5-35B-A3B on just 8GB VRAM via a pending llama.cpp PR. Meanwhile, a DFlash variant of Gemma 4 31B also dropped on HuggingFace. The DFlash technique is rapidly becoming the go-to for squeezing large models onto consumer cards.
Link →
GPU cloud compute costs spike — H100s hitting $1K/hour on spot markets
platform change
r/LocalLLaMA
Reports from r/LocalLLaMA show H100, H200, and B200 instances exceeding $1,000/hour on Vast.ai and Mithril for sustained periods, with sub-B200 GPUs increasingly unavailable. If you're running cloud inference or training workloads, audit your compute costs now — the economics of local inference just got comparatively better.
Link →
Intel auto-round: SOTA quantization across CPU/XPU/CUDA with vLLM support
new tool
r/LocalLLaMA
Intel's auto-round quantization algorithm claims SOTA accuracy at low bit-widths and works across CPU, XPU, and CUDA with native vLLM, SGLang, and Transformers compatibility. If you're deploying quantized models and want a single quantization pipeline that targets multiple backends, this is worth evaluating against GPTQ/AWQ.
Link →
Radar
GhostBox: ephemeral dev machines from GH Actions free tier
Wraps GitHub Actions into a CLI that provisions disposable VMs on demand — free cross-platform build/test environments. Clever abuse of Actions as an ephemeral compute layer for agent sandboxing or CI experiments.
Link →
Claude Opus 4.6/4.7 finetuning dataset: 8.7K chats
Synthetic reasoning dataset distilled from Claude 4.6/4.7 released on HuggingFace. Useful for anyone finetuning smaller models to mimic frontier reasoning patterns.
Link →
Convergence Watch
qwen 3.6
TRENDING
5 mentions across r/LocalLLaMA
Seven consecutive days across 3 sources. Today's signal is practical adoption — users running it as a daily coding driver with VSCode, generating SVGs in closed-loop workflows, and head-to-head comparisons with Gemma 4. The model has crossed from benchmarks into real production use.
dflash
TRENDING
3 mentions across r/LocalLLaMA
Three independent DFlash items today: PFlash (10x prefill), DFlash on 8GB VRAM, and Gemma 4 DFlash weights. This speculative decoding technique is becoming the default optimization path for running large models on consumer hardware. Watch for llama.cpp PR merge.
claude code ecosystem
TRENDING
2 mentions across GitHub Trending, r/LocalLLaMA
Fifth consecutive day. Today's signal: Browserbase ships official Claude Agent SDK browser skills, and a finetuning dataset distilled from Opus 4.6/4.7 drops. The ecosystem continues to expand from tool into platform.
STALE: Latent Space newest item is >48h old