TL;DR
Token-heavy coding agents are running into hard budget walls, pushing teams toward cheaper models, token-efficient routing, and more serious infra around retrieval and memory.
At the same time, local/browser inference, MCP-style protocols, sandboxing, and a wave of security incidents are turning agents into real software systems where architecture matters more than picking a single 'best' model.
Key Events
Report
Token spend is blowing up faster than productivity, and it's starting to kill high-profile internal deployments of coding agents.
At the same time, a cheap model tier plus local/browser inference is turning 'which frontier API?' into a second-order question compared to cost, memory, and security architecture.
Enterprises are starting to say the quiet part out loud: internal AI coding tools are costing more than engineers. Microsoft is canceling most internal Claude Code licenses and Anthropic usage after calling token-based billing unsustainable.
Salesforce alone expects to spend $300M on Anthropic tokens this year for workloads where AI handles roughly one-third to one-half of the work.
Token volume has grown about 17,000x in four years, and Uber’s COO says their AI token budget ran out early without measurable productivity gains.
Community threads describe 'tokenmaxxing' startups with seven-figure monthly token burn and increasing investor pushback on justifying those bills.
A new cheap-model tier is emerging where DeepSeek V4 Pro made a permanent 75% API price cut after a trial period. Analyses put DeepSeek roughly 11.5x cheaper than GPT-5.5 on a per-token basis while still landing on the intelligence-vs-cost Pareto frontier.
On the open-weight side, Qwen 3.7 Max is reported on par with GPT-5.4 and above Gemini 3.5 Flash for many coding tasks, while Kimi K2.6 tops a 3D design leaderboard at about one-tenth the cost of Gemini Flash 3.6.
Cursor’s Composer 2.5 is marketed as roughly an order of magnitude cheaper than both Opus 4.7 and GPT-5.5 for similar coding workloads.
Meanwhile OpenRouter says it routes about 25 trillion tokens weekly across a roster of frontier and low-cost models, normalizing multi-model backends.
Browser-native AI is moving past demos as PrismML’s Binary and Ternary Bonsai Image 4B models bring 1-bit and ternary text-to-image diffusion into ~3GB WebGPU packages.
The Local Ghost library runs Qwen2.5 fully offline in the browser using WebGPU, while llama.cpp has been adding WebGPU support for about 18 months.
Real-time audio models like LFM2.5-Audio-1.5B and video captioning models such as LFM2.5-VL-1.6B are also running client-side without server dependencies, though users still report compatibility and performance gaps across devices.
On the self-hosted side, AMD-centric Vulkan stacks report roughly 20% speed gains over ROCm and can make RX 7900-class GPUs outperform older NVIDIA 3090 cards for local LLM inference.
Developers are sharing dual-GPU setups and llama.cpp/vLLM configs that revive older cards for local agents, while GPU prices for cards like the 3090 have begun to fall from recent peaks.
Agent infrastructure is being formalized: the new AVE standard defines vulnerability classes specifically for AI agents, and by 2026 more than 30 CVEs had already been assigned to MCP infrastructure.
MCP itself now runs on over 10,000 servers and has a stateless protocol release candidate that removes handshakes, while NSA advisories warn about its cyber-risk surface.
Sandboxing is rapidly becoming default, with Runtime’s sandboxed coding agents, Gemini Managed Agents executing code in a secure Linux sandbox via one API call, and Edge.js running Node workloads inside WebAssembly sandboxes.
At the same time, supply-chain and runtime failures are piling up—from GitHub’s breach of roughly 3,800 repos via a malicious VS Code extension and the separate 'Megalodon' compromise of thousands more repositories, to ComfyUI custom nodes that can execute arbitrary Python and a Starlette auth bypass that affected FastAPI, vLLM, LiteLLM and OpenAI shims.
Dataset and JWT misuse are also in play, with a poisoned Hugging Face dataset staying live for six months and an AWS API Gateway bug where a trailing slash could bypass JWT authentication on protected endpoints.
RAG practitioners report that about 60% of failures come from retrieval, not generation, with garbage documents driving hallucinations even when the underlying models are strong.
Teams are experimenting with persistent KV caches instead of traditional chunking, salience-weighted memory retrieval to pack more useful context per prompt, and knowledge-graph-based stores that require continuous indexing.
Production agents are hitting memory and state walls rather than model limits, from Slack bots suffering retrieval decay and context loss over time, to Hermes agents where self-reinforcing memory errors accumulate and users ask for faster local retrievers.
In response, some stacks are turning to explicit memory primitives—LangGraph used for durable cross-session memory with TTL-based thread deletion, SQLite-backed memories via tools like SafeDB MCP and Claude Code, and local-first timelines like ScreenMind built on Gemma-powered indexing.
Meanwhile, people testing consumer tools note that ChatGPT-style personal memories tend to stay shallow, remembering isolated facts but not a user’s reasoning process, which aligns with broader concerns about opaque, hard-to-debug agent memory structures.
What This Means
AI engineering conversations are shifting from model worship toward infra questions—tokens, graphs, sandboxes, and memory layouts—as costs and failures hit real systems. For builders of agents and RAG stacks, the interesting story is increasingly how these low-level choices, rather than a single 'best model,' shape what ships.
On Watch
Interesting
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
Sources
Key Events
On Watch
Interesting