BUILDER SIGNAL BRIEF

Monday, May 25, 2026

← All Digests

Prompt injection exfiltrates files from Copilot Cowork — every RAG tool with file access shares this attack surface.

Top Signal

Copilot Cowork exfiltrates files via prompt injection — PoC live platform change

HN Front Page

PromptArmor demonstrated a working exploit against Microsoft Copilot Cowork: malicious content embedded in documents instructs Copilot to exfiltrate file contents to attacker-controlled endpoints. The chain is pure indirect prompt injection — user opens a weaponized doc, Copilot processes it, embedded instructions override system behavior, files leave silently. No additional user interaction required. The attack surface is any AI assistant with (1) file access, (2) external content ingestion, and (3) outbound request capability — which describes most enterprise RAG deployments. Actionable now: audit every AI feature you've shipped for this triad. Mitigations include restricting outbound URLs the model can invoke, treating all ingested content as untrusted input regardless of source, and validating outputs before any action is executed. Use the PromptArmor writeup as a threat-modeling template — this attack pattern generalizes beyond Copilot to any document-aware AI.

Fast Signals

NuExtract3: self-hostable 4B VLM for structured doc extraction new tool

r/LocalLLaMA

NuMind released NuExtract3, an open-weight 4B vision-language model purpose-built for structured JSON extraction from PDFs, images, Markdown, and OCR'd documents. Self-hostable alternative to GPT-4V for document parsing pipelines — drop it in where you're paying API costs for invoice, form, or report extraction workflows.

Link →

Full attention → sparse in 100 steps: cheap long-context adaptation research to practice

r/LocalLLaMA

New paper shows pre-trained full-attention models can be converted to sparse attention in under 100 training steps with minimal accuracy loss — no full re-pretraining required. Practical implication: adapt existing base models for efficient long-context inference at a fraction of the usual compute. Watch for llama.cpp and vLLM integrations as this matures.

Link →

earendil-works/pi: coding agent CLI + unified LLM API in one toolkit new tool

GitHub Trending

pi is a trending GitHub toolkit bundling a coding agent CLI, unified multi-provider LLM API, TUI and web UI libraries, a Slack bot, and vLLM pod support. If you're assembling agent infrastructure from scratch, benchmark this as a framework baseline before building from parts.

Link →

OSCAR RotationZoo: 2-bit KV cache quant without accuracy collapse research to practice

r/LocalLLaMA

OSCAR applies offline spectral covariance-aware rotations to enable 2-bit KV cache quantization — more aggressive than standard Q4/Q5 KV approaches. If you're serving long-context models at scale where KV cache memory is your bottleneck, this is the current research frontier to track before productionization.

Link →

llama.cpp split mode tensor crash fix imminent — 35% TG speedup unlocked platform change

r/LocalLLaMA

Split mode tensor delivers ~35% throughput gain over layer split for multi-GPU setups but currently crashes every 90-120 min due to VRAM exhaustion. A fix PR appears imminent. Multi-GPU llama.cpp operators: watch the PR tracker and prep to enable split mode the moment it lands.

Link →

ThriftAttention: selective FP4 precision for long-context attention research to practice

r/LocalLLaMA

ThriftAttention selectively applies FP4 precision only to attention heads where low precision costs least accuracy in long-context inference. Paired with OSCAR above, this represents an emerging two-part toolkit for extreme KV compression — read both papers together for a complete long-context optimization picture.

Link →

Radar

cmux: macOS terminal built for AI coding agents

Ghostty-based macOS terminal with vertical tabs and per-agent notifications — purpose-built for running multiple AI coding agents in parallel. Early project, but signals a UX category forming around terminals that treat agents as first-class session types. Link →

MiMo-V2.5-coder: Xiaomi iterates on local coding model

MiMo-V2.5-coder appeared on r/LocalLLaMA; MiMo V2 showed competitive coding results and Xiaomi has iterated quickly. No benchmarks in initial post — worth watching for community evals this week. Link →

anthropics/knowledge-work-plugins: official Claude Cowork plugin repo

Anthropic published an open-source repo of knowledge-worker plugins for Claude Cowork — and a file-exfiltration vuln dropped the same day. If you're building on Cowork's plugin API, study this repo to understand the intended security model before deploying. Link →

Convergence Watch

qwen3.6

5 mentions across r/LocalLLaMA, r/LocalLLaMA, r/LocalLLaMA, r/LocalLLaMA, r/LocalLLaMA

Qwen3.6 35B A3B has dominated r/LocalLLaMA for 5+ consecutive days with steadily rising source counts (4→5→7 over the past three days, plus 5 today). Community consensus is solidifying: it's the current best local model for agentic tasks. A V100 cluster is already achieving 1000+ TPS on Qwen3.6 27B. If you haven't evaluated it as your default local agent backbone, this is the week to do it.

STALE: Latent Space newest item is >48h old