Simon Willison's live experiment at the AI Engineer World's Fair this week produced a workflow worth documenting. The setup: define a measurable metric for your agent, build a small eval set of known-good examples, run DSPy's MIPROv2 optimizer against it. The optimizer rewrites your system prompt automatically, no manual iteration required. Willison applied this to Datasette's SQL-generation agent and published the full research notebook on GitHub. The signal is not the tool but the posture shift it represents: most teams still tune prompts by feel. DSPy treats the problem as optimization. The minimum requirements are modest: any binary success signal, query execution, structured output validation, test pass/fail.
A research team's proof-of-concept AI worm, detailed in "AI Agents Enable Adaptive Computer Worms" from CleverHans Lab at the University of Toronto, self-replicates using only local open-weight models, with LLMs handling the propagation logic. No cloud API dependency, no rate limits, no API key revocation as a containment lever. Anyone who has shipped an agent pipeline consuming untrusted email has been running this threat model in their head for months. Now it has a citation. The structural implication for production pipelines: agent systems that ingest external content, emails, files, web pages, tool outputs, now have a fully offline attack surface to account for. The same week, Tencent open-sourced CubeSandbox, production-grade isolated sandboxes designed for agents executing untrusted code, with instant startup and full lifecycle control built for scale. The threat and one production-ready containment architecture arrived in the same week.
Same pattern, different week.
Community developer @xenovacom published WebGPU kernels pushing Gemma 4 31B to 255 tokens per second in-browser. Browser inference has been a demonstrated capability for two years, but latency kept it impractical for real interactive use. At 255 tok/s, that constraint is gone. Privacy-preserving, zero-API-cost, fully client-side inference moves from theoretical architecture to a production conversation.
At AI Engineer, Geoffrey Litt named a problem that production agent builders are encountering without clear language for it. His framing: effective human-AI collaboration requires agents to make their decisions legible enough for humans to meaningfully participate in the pipeline. He called it "understand to participate," laid out in his own post-conference writeup and picked up in Simon Willison's conference notes. The architectural read is direct: agents that produce correct outputs but cannot be diagnosed when they fail accumulate a debt that compounds as pipelines grow more complex. Whether legibility tooling arrives as a standard layer or each team rebuilds it from scratch is still open. Litt's "understand to participate" requirement has no standard implementation. The spec exists; the tooling doesn't.