The Brief, Monday, May 25, 2026

A paper published this week on arxiv (2605.06445) documents a failure mode that most builders who have shipped a coding agent already recognize from production: LLM agents progressively drop constraints stated early in their context as sessions grow longer. The researchers call it "constraint decay," and they demonstrate it is a structural attention and retrieval failure, not random hallucination. Agents comply with schema restrictions, authentication requirements, and business rules at the start of a session. They follow the spec. Then, as context depth increases, they begin silently violating the same rules they initially obeyed. The violations are predictable in their structure, reproducible across model families, and invisible to the user until something breaks downstream.

The paper's proposed mitigations are as interesting as the failure mode itself, because they reveal where the real problem lives. Re-inject critical constraints at regular context intervals, not just at session start. Decompose long generation tasks into bounded subtasks with explicit constraint re-statements at each boundary. Add lightweight post-generation constraint verification. Every proposed fix operates at the orchestration layer. The model understood the constraints and complied with them initially; the system surrounding the model failed to sustain that compliance across a long session. The problem is architectural through and through, and the solutions are too.

That distinction between model capability and system architecture showed up independently in three places over a single weekend. Epoch AI published data this week showing that memory now accounts for nearly two-thirds of AI chip component costs, a structural shift from the compute-dominated bill-of-materials configurations that defined the training era. For the workloads builders care about most right now (MoE inference, long-context applications, KV cache-heavy agent pipelines), the cost bottleneck has already migrated from compute to memory bandwidth. The infrastructure economics have moved ahead of the popular conversation about them. Every dollar spent on KV cache compression or quantization is hitting the single largest line item on the infrastructure bill, and the returns on that compression compound as memory's cost share continues to grow.

Then there are the tools, which make the pattern concrete at the practitioner level. A developer shipped hipEngine this week: hand-written RDNA3 GPU kernels built specifically to maximize Qwen 3.6 MoE throughput on AMD's Strix Halo and 7900 XTX hardware. This is custom low-level inference work, outside llama.cpp, that exists because generic backends leave measurable performance on the table for specific silicon. The same weekend, DeepSeek released Reasonix, a coding agent architected from the ground up around prefix caching mechanics, structuring every prompt to maximize cache reuse and minimize marginal token cost. Both projects share a premise with the constraint decay paper and the Epoch AI data: the model is already good enough, and the engineering problem is everything surrounding it. Three independent signals pointing at the same underlying dynamic. The pattern is specific enough to name: architecture season. The phase when raw capability has crossed a sufficiency threshold and the returns shift to engineering around the platform's constraints.

Platform transitions follow this arc reliably, and watching prior cycles helps calibrate the shape of this one. I was tracking the web development industry in 2005 when the AJAX wave hit: Google Maps, Gmail's dynamic interface, the entire generation of Web 2.0 products that redefined what a browser-based application could feel like. The creative explosion came from developers who had internalized that browsers were capable enough and shifted their engineering energy to architecting around the constraints. Single-threaded execution, same-origin policy, stateless HTTP: these became engineering parameters to optimize around, not limitations worth complaining about. The mobile cycle followed the same arc within a few years. Once smartphones crossed a capability threshold around 2012, the products that defined the era (Uber, Instagram, WhatsApp) were each architectural bets on sufficient hardware. They engineered aggressively around battery life, intermittent connectivity, and small-screen interaction models.

LLMs are crossing the same capability-sufficiency threshold. The frontier models available today, including open-weight models running on consumer GPUs, handle a wide range of production tasks competently. The constraint decay paper makes this inflection concrete: the models the researchers tested understood the schema restrictions and business rules and complied with them at session start. The failure emerged from how the surrounding system managed attention and retrieval over a growing context window. The fix is a set of architectural patterns (constraint re-injection, task decomposition, post-generation verification) that work regardless of which model sits at the center. When both the failure mode and the solution live in the orchestration layer, the model has completed its transition from product to platform. The engineering energy shifts to everything around it, which is exactly the transition architecture season describes.

The builders who internalize this shift early accumulate compounding advantages. The hipEngine developer chose to write custom kernels for specific AMD silicon rather than waiting for generic backends to optimize their dispatch paths. The Reasonix team at DeepSeek architected their entire coding agent around cache economics rather than assuming future model generations would solve their cost structure. These are infrastructure investments whose returns are immediate. And because infrastructure expertise compounds through institutional learning, through kernel-level familiarity with specific hardware, through optimized prompt structures tuned to specific caching behavior, the advantages become increasingly expensive for late movers to replicate. Model access democratizes on a predictable cadence: weights get released, APIs get cheaper, fine-tuning tools improve. Architecture expertise follows a different curve entirely.

The question of where durable value accrues in the current AI cycle sharpens considerably under this lens. If the model is the platform and architecture is the differentiator, then proprietary orchestration layers, hardware-specific inference paths, and cache-aware prompt structures become competitive advantages that survive model upgrades intact. A better base model raises every product's ceiling by the same increment. Better architecture around the current model raises one ceiling selectively and cumulatively. The asymmetry is structural: model capability is a public good that improves on a shared schedule, while infrastructure expertise is a private good that compounds on an individual one.

Qwen 3.6 MoE tells the story at practitioner scale. The model has dominated local inference discussion for three consecutive days, and the conversation has already shifted from benchmark comparisons to hardware-specific kernel optimization and production stability testing. Developers are surfacing multi-token prediction bugs in tool-calling pipelines, filing llama.cpp issues, building custom RDNA3 inference paths, and debating VRAM allocation strategies across hardware configurations. The model shipped and the architecture work started the same day. The community has moved past capability evaluation and into infrastructure engineering. That sequence, capability sufficient and engineering focus migrating to everything around it, is architecture season at the community level.

ARCHITECTURE SEASON