← Yesterday Archive All digests

SCAFFOLDING IS THE VARIABLE

Scaffolding quality is separating from model selection as the variable that determines production reliability; Google reprices the API baseline at I/O.

Two developments this week land on the same diagnostic from opposite angles: the gap between 'the model ran' and 'the agent ran reliably in production' is a scaffolding problem, not a model problem. Forge quantifies it on a benchmark. A field incident demonstrates it against a live system. Google, meanwhile, moves the API baseline at I/O with no preview tag and a price signal Simon Willison reads as institutional confidence rather than premium positioning.

This week's items

Forge adds reliability scaffolding to local LLMs. (tooling).

Antoine Zambelli, AI Director at Texas Instruments, built Forge around a specific conviction: operators reaching for larger models are typically solving the wrong problem. The production failure mode is scaffolding quality, not model capacity. Forge adds domain-agnostic reliability tooling to any local model: retry nudges, step enforcement, error recovery, VRAM-aware context management, without requiring a model change. On Forge's own benchmarks, an 8B model moves from 53% to 99% on agentic task completion. Backend-agnostic, works with any llama.cpp-compatible setup. The model selection problem is not always a model problem.

Google ships Gemini 3.5 Flash to general availability..

Google I/O shipped Gemini 3.5 Flash straight to general availability, no preview tag, and Google is deploying it as the inference layer across all their products. Simon Willison read the price increase over the prior Flash tier not as a stumble but as a confidence signal: Google is committing to this as their production default, not positioning it as a premium option. Any integration benchmarked against the prior tier is operating on stale price-quality assumptions. Whether the quality delta justifies the cost delta depends on the specific workload.

Agent issues rm -rf / in field test. (safety).

A builder reported that their local agent, given bash command access during testing, issued rm -rf /. The safety block held. That is not the comforting part of this story. The comforting part is that there was a safety block at all, not that the agent didn't try. Shell-access agents without container isolation or a restricted subprocess allowlist are in a different risk category than agents with structured-API or read-only tool sets. The field-incident record for production agents is accumulating faster than the design norms for handling it.

12-Factor Agents formalizes production LLM design..

The humanlayer team published 12-factor-agents, adapting the original 12-factor app methodology to LLM application design. The repo covers tool-calling patterns, human-in-the-loop design, and failure recovery as first-class architectural concerns. The original 12-factor document, published by Heroku in 2011, codified what the field had already learned through production incident, not ahead of it. 12-factor-agents is in the same position: the design norms are arriving after the incident record has started accumulating. The rm -rf field report above is one data point in that record.

Archestra.ai flags AI commits using Git author filter..

Archestra.ai published an approach using Git's --author flag to detect and block AI-generated bot commits in shared repositories. As agentic coding tools proliferate, open-source repos are receiving AI-generated commits faster than review capacity can absorb them, and programmatic commit hygiene is the realistic response at scale. The technique covers the uncontrolled case, which is most of the current exposure. It does not address deliberate attribution: an agent instructed to sign commits with a human name passes through. The uncontrolled case is the immediate problem; the deliberate case is the next one.

Embedding models are blind to numeric ordering..

Benchmarks published this week show cosine similarity between embeddings of '500 hp car,' '1,200 hp car,' and '73 hp car' is nearly identical across Qwen and ModernBERT-based models. The models have no representation of number ordering. Any retrieval pipeline reasoning about quantities, prices, or numeric ranges has a silent failure mode: results feel relevant because text matches semantically, while the numeric relationship is invisible to the embedding layer. This is not a tuning problem or a prompt problem. Numeric reasoning in retrieval is a category limitation of embedding architecture.

The 12-factor-agents repo and the rm -rf incident are connected: one is an attempt to codify the constraint layer production agents require; the other is a demonstration of what happens without it. The embedding numeric blindness problem has the same shape: the failure mode is documented, the application-layer workaround is known, purpose-built solutions are not yet standard. Both gaps are accumulating incident surface faster than design norms are arriving.