Apple's AI stack is now Gemini underneath — CoreAI ships as the new builder surface.
Top Signal
Apple pivots AI stack to Gemini, ships CoreAI framework for developers
platform change
HN Front Page
Apple announced its new Apple Intelligence architecture runs on Google Gemini models under the hood, with a new CoreAI framework exposed to app developers. Three separate HN articles today confirmed the shift: the MacRumors architecture piece, the updated apple.com/apple-intelligence page, and the developer.apple.com/documentation/coreai/ reference. For builders: if you ship iOS/macOS apps, the model powering on-device AI just became Gemini — which means capabilities, tokenization behavior, and API surface all change. CoreAI is the new abstraction layer. Skim the developer docs now to understand what's exposed vs. what's private. If you're building cross-platform AI features, Gemini's behavior on Apple devices may diverge from what you test cloud-side — plan for that in your eval suite and add Apple-specific regression tests.
Read more →
Fast Signals
MiMo-V2.5-Pro claims 1,000 tok/s on 1T MoE from a single 8-GPU server
emerging signal
HN Front Page, r/LocalLLaMA
Xiaomi's MiMo team claims 1,000+ output tokens/sec on a 1 trillion parameter MoE model using a single standard 8-GPU node — roughly 10x what most hosted APIs deliver per active user. If numbers hold under independent replication, this resets batch inference cost expectations at frontier scale. Watch for community benchmarks.
Link →
Gemma 4 QAT: official Google weights broken — use Unsloth UD Q4_K_XL
workflow
r/LocalLLaMA
Community finding: Google's official Gemma 4 QAT GGUF has a quantization bug where llama-quantize incorrectly quantizes the token embedding to q6k (missing the --pure flag). Practical fix is immediate: switch to Unsloth's UD Q4_K_XL variant. If you downloaded official QAT weights and see degraded output quality, this is the cause.
Link →
Luce Spark: 35B MoE on 16GB GPU with no offload speed penalty
new tool
r/LocalLLaMA
Community release claiming a 35B MoE model fits a single 16GB GPU without the typical 2-4x throughput penalty from layer offloading. Distinct from standard quantization — benchmark screenshots posted. Worth testing if you're GPU-constrained and running MoE architectures locally.
Link →
llama.cpp PR #24269 adds video input to local multimodal stack
platform change
r/LocalLLaMA
Active maintainer ngxson opened PR #24269 to add video frame input support to llama.cpp's mtmd (multimodal) subsystem. Video-as-context in local inference has been a capability gap — if merged, it brings local vision pipelines to parity with hosted multimodal APIs. Watch main branch.
Link →
OpenEnv becomes open-source consortium for agent training environments
emerging signal
r/LocalLLaMA
OpenEnv — a framework for creating agentic execution environments (terminals, browsers, any surface an agent can interact with) — is now jointly owned by HuggingFace, PyTorch, Prime Intellect, Unsloth, Modal, and Mercor. This is the infrastructure layer for training agents on real environments going fully open-source. Likely to become the standard toolkit for agent RL.
Link →
turbovec: TurboQuant-based vector index in Rust with Python bindings
new tool
GitHub Trending
New GitHub project applying Google's TurboQuant (quantization-aware vector compression) to vector search, built in Rust with Python bindings. If TurboQuant's accuracy-at-low-bitwidth holds for embedding spaces, this could meaningfully cut memory for large vector indices. Very early stage — the technique is the signal, not the maturity.
Link →
Radar
NanoQuant: 0.5–2-bit dense model quantization
Community implementation of a paper enabling sub-2-bit quantization of dense transformers — 0.5 bits/weight means 16x memory reduction vs FP8. Current quality tradeoffs are unmapped. Bookmark for when you need to squeeze a model into genuinely impossible hardware constraints.
Link →
datasette-agent-edit: agent makes targeted edits, not rewrites
Simon Willison's new alpha plugin lets an AI agent make in-place edits to existing text rather than regenerating from scratch. The architectural pattern — 'edit existing artifact' rather than 'rewrite' — is more accurate and cheaper; worth borrowing in any pipeline where you're regenerating output that mostly hasn't changed.
Link →
Convergence Watch
gemma 4 qat
TRENDING
8 mentions across r/LocalLLaMA, HN Front Page
Day 4 of sustained coverage. Today's new findings: official Google QAT weights have a quantization bug (actionable fix: use Unsloth UD Q4_K_XL), and QAT+MTP combined on a 3090 yields 1.2–1.8x TPS improvement. Community has reached consensus — the story is now fully actionable. Convergence likely decelerating.
apple intelligence
3 mentions across HN Front Page
Three distinct HN articles today confirm Apple Intelligence now runs on Gemini with a new CoreAI developer framework. Day 1 — full signal. Expect follow-on coverage of what CoreAI exposes to developers, and what Apple locks down, over the next week.
kvarn
TRENDING
4 mentions across r/LocalLLaMA, HN Front Page
Third consecutive day of KV cache compression coverage. Today adds ggerganov's PR #24277 to llama.cpp to avoid KV cell copies — an orthogonal optimization. KVarN quantization plus cell-copy avoidance together could compound into meaningful long-context throughput gains. Watch for both PRs merging.
llama.cpp
5 mentions across r/LocalLLaMA
Active PR week: Gemma4 MTP support merged, E2B/E4B MTP PR open, video input PR open, KV cell-copy avoidance PR open. If you run llama.cpp for local inference, pulling main this week gives you compounding improvements across speculative decoding, multimodal, and cache efficiency.
STALE: Latent Space newest item is >48h old