A llama.cpp fork nearly 5x-es inference on consumer GPUs; prompt injection detection goes browser-native.
Top Signal
BeeLlama v0.2.0 hits 164 tok/s on RTX 3090 via DFlash — 4-5x stock speeds
new tool
r/LocalLLaMA
BeeLlama.cpp, an under-the-radar llama.cpp fork, ships v0.2.0 with a major DFlash update: Qwen3.6 27B reaches 164 tok/s (4.4x speedup) and Gemma 4 31B hits 177.8 tok/s (4.93x) on a single RTX 3090, with prompt processing speed near baseline. DFlash is a specialized decode-path optimization baked into the fork's architecture. Benchmarks are concrete, hardware is consumer-grade, and the fork is a near drop-in for existing llama.cpp workflows. This is the pattern of forks that make it into mainstream tooling 3-6 months after first sighting. If you're self-hosting inference on consumer NVIDIA GPUs, benchmark this against your current stack before your next hardware decision. Repo at github.com/Anbeeld/beellama.cpp — start with the Q4 quant of your target model and compare tok/s directly.
Read more →
Fast Signals
Browser-native prompt injection detector trained on DeepSeek v4 Flash
new tool
r/LocalLLaMA
A dev fine-tuned a prompt injection classifier using ml-intern and DeepSeek v4 Flash; it runs entirely in the browser — zero latency, zero API cost. If you're building anything that pipes user-controlled text into an LLM, this is a deployable guardrail you can drop in today.
Link →
notebooklm-py: full programmatic Python API for Google NotebookLM
new tool
GitHub Trending
Unofficial Python package giving agents complete API access to NotebookLM — including capabilities the web UI doesn't expose — plus a Claude Code agentic skill. If you're building research pipelines or knowledge management workflows, this turns NotebookLM into a programmable backend rather than a manual tool.
Link →
Understand-Anything: any codebase → interactive, queryable knowledge graph
new tool
GitHub Trending
Open-source tool that ingests code and builds an explorable knowledge graph with Claude Code, Codex, Cursor, Copilot, and Gemini CLI integrations. Directly addresses the agent code-navigation problem at scale. Worth spinning up before a large refactor or onboarding push.
Link →
DeepSeek V4 Pro makes 75% price cut permanent after May 31
platform change
HN Front Page
DeepSeek confirmed the deepseek-v4-pro discount becomes permanent at 1/4 of original list price after the promo expires. No action required — just update your cost models if you're using it at scale.
Link →
Community fine-tune adds diarization + timestamps to Cohere Transcribe
new tool
r/LocalLLaMA
Cohere Transcribe is widely considered the strongest open-source STT model but ships without speaker diarization or timestamps. A community fine-tune adds both. Worth testing before paying for Whisper-based proprietary APIs if you're building transcription pipelines.
Link →
Kanbots: open-source Kanban desktop that runs parallel agents per card
new tool
HN Front Page
Open-source desktop Kanban app where every card spawns parallel AI agents. 143 HN upvotes, 85 comments signals genuine builder interest. Early-stage but represents an emerging 'agentic project management' pattern — the architecture is more interesting than any single app.
Link →
Radar
dotnet/skills: Microsoft's official agent skills repo for .NET/C#
Microsoft launched an org-maintained GitHub repo of AI agent skills for .NET and C#. Skills-as-packages is becoming a first-class pattern in mainstream ecosystems — worth watching as a signal of where the ecosystem standardizes.
Link →
Qwen3.6-35B: 262k context on 8GB VRAM at 30 tok/s
A community Q4 quant achieves 262k context on a single RTX 3070 Ti at 30 tok/s — a hardware budget that previously couldn't handle these context lengths at useful speeds. Raises the floor for edge deployments.
Link →
Convergence Watch
qwen3.6
TRENDING
5 mentions across r/LocalLLaMA, r/LocalLLaMA, r/LocalLLaMA, r/LocalLLaMA, r/LocalLLaMA
Qwen3.6 has appeared across 4+ of the last 7 briefing days. Today's posts cover 262k context on 8GB VRAM, ByteShape quants beating Unsloth IQ by 30%, and 27B pure quants hitting 40 tok/s on 16GB. The community is rapidly mapping the practical deployment envelope — Qwen3.6-35B-A3B is solidifying as the default local coding/agent base for constrained hardware.
beellama.cpp
1 mentions across r/LocalLLaMA
First sighting. The 4-5x throughput claims on concrete consumer hardware are notable — if reproducible across more configurations, this fork could become a preferred inference backend for NVIDIA consumer GPU deployments. Watch for community reproduction reports over the next week.