BUILDER SIGNAL BRIEF

Sunday, May 03, 2026

← All Digests

A $150 FPGA runs Qwen3-30B at 18 tokens/sec — inference hardware gets weird.

Top Signal

Hummingbird+ paper: $150 FPGAs run 30B MoE models at 18 t/s research to practice

r/LocalLLaMA

A new paper on Hummingbird+ demonstrates Qwen3-30B-A3B Q4 running at 18 tokens/sec generation on low-cost FPGAs with 24GB memory, at an expected mass production cost of $150. This matters because it opens a third hardware path for local inference beyond GPUs and NPUs — one that's an order of magnitude cheaper per unit. FPGAs have historically been too hard to program for ML workloads, but MoE architectures with small active parameters change the math: you only need to feed 3B active params through the accelerator. If you're building inference infrastructure or edge deployments, this is worth tracking. The paper is published through ACM, not a startup pitch — the benchmarks are real. Bookmark this if you're planning any kind of fleet or embedded inference deployment.

Fast Signals

AMD Strix Halo refresh ships with 192GB unified memory platform change

r/LocalLLaMA

AMD's updated Strix Halo APU doubles to 192GB unified memory, enough to run unquantized 70B models or quantized 128B models on a single chip. If you've been waiting for a consumer-tier box that can host Mistral Medium 3.5 or full Qwen3.6-72B locally, this is the inflection point. Separate benchmarks show Mistral Medium 3.5 128B is still painfully slow on current Strix Halo (2 hours for 48k context) — the memory is there but bandwidth remains the bottleneck.

Link →

Apple SHARP 3D Gaussian splatting runs in-browser via ONNX Runtime new tool

HN Front Page

A developer ported Apple's SHARP model (single-image to 3D Gaussian splat) to run entirely in the browser using ONNX Runtime Web — no server, no Python pipeline. This is a clean demo of the pattern: take a heavy PyTorch research model, export to ONNX, deploy client-side. If you're building any vision or 3D feature, this shows the ONNX-in-browser path is production-viable for non-trivial models.

Link →

Voice Agents from Scratch: full local pipeline tutorial, no API keys workflow

r/LocalLLaMA

A chapter-by-chapter repo walks the complete real-time voice agent pipeline: microphone capture → Whisper STT → local GGUF LLM → Kokoro TTS → speaker output. Everything runs locally with no API keys. If you've wanted to prototype a voice agent without cloud dependencies, this is the most complete open tutorial available right now.

Link →

Intel and AMD unveil x86 AI Compute Extensions (ACE) joint spec platform change

r/LocalLLaMA

Intel and AMD jointly announced ACE, a new x86 instruction set extension for CPU-based AI inference, developed under the x86 Ecosystem Advisory Group to prevent ISA fragmentation. This is a long-term signal: if ACE gets broad adoption, CPU-only inference becomes meaningfully faster without requiring GPUs. Watch for llama.cpp and ONNX Runtime to add ACE backends.

Link →

Qwen3-TTS ported to OpenVINO for Intel hardware inference new tool

r/LocalLLaMA

A developer shipped a from-scratch OpenVINO implementation of Qwen3-TTS, enabling text-to-speech on Intel CPUs and XPUs without PyTorch. Originally merged into OpenArc in March, now released as standalone code. If you're deploying TTS on Intel hardware or building voice features without NVIDIA, this fills a real gap.

Link →

Radar

Gemma 4 E2B produces clean JSON on 8GB Android

A developer reports Gemma 4 E2B (2.4B) running on a OnePlus with 8GB RAM produces reliable structured JSON output — a key capability for on-device agentic apps. If you're building mobile features that need local structured extraction, this model punches above its weight. Link →

TALOS-V2: full GPT on FPGA at 50,000 tps

A student project runs Karpathy's MicroGPT (4,192 params) at 50K tokens/sec on an FPGA. Toy-scale, but combined with Hummingbird+ above, FPGA inference is getting real attention from multiple independent teams. Worth watching as a hardware diversification signal. Link →

GPT 5.5 CoT leak shows compressed 'caveman' reasoning

A user caught GPT 5.5's chain-of-thought in Codex using abbreviated, compressed language — matching a technique r/LocalLLaMA proposed 5 months ago to reduce reasoning token costs. If you're building reasoning scaffolds, compressed internal monologue is a validated production pattern. Link →

Convergence Watch

qwen 3.6

8 mentions across r/LocalLLaMA, GitHub Trending

Seventh consecutive day dominating discussion. Today's signal: 27B vs 35B MoE debate crystallizing (35B preferred for multi-stage pipelines), fine-tunes emerging (Assistant_Pepe_32B), and function-calling benchmarks comparing Qwen to GLM and DeepSeek. The ecosystem around these models is maturing fast.

fpga inference

2 mentions across r/LocalLLaMA

Two independent FPGA inference projects surfaced the same day: Hummingbird+ running production-scale MoE models and TALOS-V2 as a from-scratch educational implementation. Not yet a trend, but the simultaneous appearance suggests growing interest in non-GPU inference hardware.

amd local inference

3 mentions across r/LocalLLaMA

Strix Halo 192GB refresh, Mistral Medium 3.5 benchmark data on current Halo, and x86 ACE extensions all point to AMD's growing relevance for local inference. Memory ceiling keeps rising but bandwidth remains the constraint — important context for anyone planning AMD-based inference rigs.

local agentic coding

3 mentions across r/LocalLLaMA

Strong sentiment shift: multiple posts from developers who previously dismissed local models now reporting competitive results with cloud coding agents. Combined with yesterday's 4-source convergence, the gap between local and cloud agentic coding is closing faster than expected.

STALE: Latent Space newest item is >48h old