Nvidia ships an official Rust-to-CUDA compiler; TanStack npm packages compromised.
Top Signal
CUDA-oxide: Nvidia Ships Official Rust-to-CUDA Compiler
platform change
HN Front Page
Nvidia Labs released CUDA-oxide, a compiler that lets you write CUDA GPU kernels in Rust instead of C/C++. This isn't a community wrapper — it's an official Nvidia project with full toolchain support. For AI builders, this matters because custom inference kernels, quantization routines, and attention implementations can now be written in Rust with its memory safety guarantees, while compiling to the same PTX that nvcc produces. The immediate unlock: teams already using Rust for their serving infrastructure (axum, tonic) can now extend that stack into GPU code without context-switching to C++. Early benchmarks on HN show parity with nvcc output. If you maintain custom CUDA kernels for inference or training, evaluate whether migrating to Rust reduces your bug surface. If you're starting new GPU work, this is now the default recommendation over raw CUDA C++.
Read more →
Fast Signals
TanStack NPM Packages Compromised — Check Your Lockfile
platform change
HN Front Page
TanStack Router (and potentially other TanStack packages) were compromised on npm. With 381 HN points and active discussion, this is confirmed and being triaged. If you use any TanStack packages — Router, Query, Table — audit your lockfile immediately and pin to known-good versions from the GitHub issue thread.
Link →
Every Way Local Models Break JSON — Repair Library from 288 Calls
new tool
r/LocalLLaMA
A developer ran structured output prompts through 288 model calls across Llama 3, Mistral, DeepSeek, Qwen, and others, cataloguing every JSON failure mode. The result is an open-source repair library. If you're doing structured output with local models (especially without grammar-constrained decoding), this is a drop-in safety net.
Link →
TextWeb: Markdown Browser Replaces Vision Models for Agent Browsing
new tool
r/LocalLLaMA
Instead of screenshotting web pages and piping them through vision models, TextWeb renders pages as annotated markdown that LLMs can reason about natively — with full JS execution and interactive element annotation. Given that computer use is 45x more expensive than structured APIs (flagged last week), this is a practical cost-cutting alternative for web-browsing agents.
Link →
Shopify's River: Internal Coding Agent That Only Works in Public Slack
workflow
Simon Willison
Tobi Lütke describes Shopify's internal agent River, which refuses DMs and operates entirely in public Slack channels. The design choice forces transparency — every agent action is observable by the team. Worth studying as a deployment pattern if you're building internal coding agents: public-by-default constrains agent behavior and builds team trust simultaneously.
Link →
Prompt Caching for RL Training Delivers 7.5x Speedup
research to practice
r/LocalLLaMA
A new technique applies prompt caching to reinforcement learning training loops, yielding 7.5x speedups on long-prompt/short-response workloads. If you're doing RLHF or GRPO-style training where the prompt is mostly static across rollouts, this directly reduces your compute bill.
Link →
oMLX: Menu-Bar Inference Server with SSD Caching for Apple Silicon
new tool
GitHub Trending
New inference server purpose-built for Apple Silicon with continuous batching and SSD-backed KV cache — managed from the macOS menu bar. Trending on GitHub. If you're running local models on Mac for development, this is a more production-grade alternative to ollama with better memory management for long contexts.
Link →
Radar
Nemotron-3-Super: 500k Context on 48GB at 21 tok/s
A 64B-A12B MoE model tuned for coding handles 500k context on a single 48GB card at 21 tok/s. Worth watching if you need very long context for local agentic coding without multi-GPU setups.
Link →
Gemma 4 on WebGPU Controlling a Robot via WebSerial
Gemma 4 running fully offline in-browser via Transformers.js on WebGPU, controlling a Reachy Mini robot over WebSerial. Demonstrates the viability of browser-native LLM inference for edge robotics without any server.
Link →
LLM in the Shebang Line of a Script
You can put an LLM CLI tool in a script's shebang line, making the script itself the prompt. Niche but elegant pattern for self-documenting AI-powered shell scripts.
Link →
Convergence Watch
multi-token prediction
TRENDING
3 mentions across r/LocalLLaMA, r/LocalLLaMA, r/LocalLLaMA
MTP continues its 7-day streak across sources. Today: Unsloth adds MTP training support, llama.cpp b9109 ships preemptive MTP+mmproj fix. The stack is maturing from experimental to default — if you're fine-tuning or deploying local models, MTP support is becoming table stakes.
local agentic coding
TRENDING
3 mentions across HN Show, GitHub Trending, Simon Willison
Seven consecutive days across 3+ sources. Today's signals: adamsreview (multi-agent PR reviews for Claude Code), Shopify's River pattern, everything-claude-code harness. The pattern is shifting from 'can local models code?' to 'how do we orchestrate and review agent-written code?' — the tooling layer is where the action is now.
STALE: Latent Space newest item is >48h old