TL;DR
This month was less 'new GPT moment' and more stress test of the whole stack: OpenAI kept its lead even as a Pentagon deal fueled a visible ethics‑driven migration toward Claude.
At the same time, Chinese/open models like Qwen, GLM‑5, and Kimi quietly hit frontier‑grade benchmarks while agent frameworks, KV‑cache hacks, and WebMCP‑style tool protocols emerged as the most unstable—and exploitable—parts of the ecosystem.
Key Events
Report
The strangest pattern this month: the labs selling themselves hardest as 'safe' and 'responsible' are now writing classified strike software and, in one case, powering a 150GB government data heist.
At the same time, an 'open' agent framework just leapfrogged React on GitHub while shipping with thousands of known vulnerabilities and a named attack class.
OpenAI still owns consumer mindshare—around 900 million weekly actives and 50 million paying subscribers—even as a Pentagon deal triggered a 295% spike in ChatGPT uninstalls and a loud 'Cancel ChatGPT' wave.
That backlash has real teeth at the edges: Claude Cowork hit #1 on the U.S. App Store, Claude topped free charts in the U.S. and Canada, and users explicitly cite Anthropic’s Pentagon stance and data‑handling as reasons for jumping ship.
But Anthropic is also running custom Claude models 1–2 generations ahead of consumer for the Pentagon, reportedly used in strikes on Iran and cleared for classified work, just as 300+ Google/OpenAI staff protest military AI and the Pentagon explores stripping safety features via the Defense Production Act.
GLM‑5 (744B params, AA Index 50) and Kimi K2.5 (50.2 on Humanity’s Last Exam at about $0.28 per task) now sit within single‑digit benchmark points of leading proprietary models while running on commodity NVIDIA Blackwell.
Qwen 3.5‑35B‑A3B reportedly beats GPT‑OSS‑120B on coding at a third of the size, runs beyond 1M tokens of context on a 32GB GPU, and hits roughly 2,000 tokens per second on dual‑3090 rigs, while smaller 0.8B–9B variants dominate Hugging Face charts and run on 5GB RAM.
The flip side is governance and stability: Qwen’s larger models show notable hallucinations and odd zero‑shot drops, key staff like Junyang Lin have left, DeepSeek is racing from v3 to v4 in four months with open weights on Chinese chips amid bias‑gap and data‑theft allegations, and Google’s Nano Banana 2 now leads text‑to‑image from a closed U.S. stack.
OpenClaw’s promise of local personal agents rocketed it to roughly 246k GitHub stars—above React—by making it trivial to orchestrate email, scheduling, and even multi‑device control.
Security audits then found over 2,000 vulnerabilities (10 critical) and a 'ClawJacked' technique where hostile sites hijack installs, while broader scans show 80% of agent repos with vulns and 38% with critical ones, usually lacking basic human‑oversight gates.
Underneath that, the tool layer is equally porous: 41% of official MCP servers ship without auth, honeypots like HoneyMCP already exist to catch rogue probes, and teams rely on observability tools such as LangSmith that both cost real money and complicate data‑privacy for production traces.
Qwen3.5‑35B‑A3B plus KV‑cache engineering show how far raw context is stretching: about 74.7 tokens per second with q8_0/bf16, million‑token contexts on a single 32GB GPU, and ~2,000 tokens per second on dual‑3090 setups.
Developers are also running into the edge cases—slowdowns from aggressive KV cache clearing, fp8 caches corrupting outputs until switched to bf16, and unpredictable behavior during context switches—so the 'infinite history' illusion rests on brittle internals.
Meanwhile Claude now remembers across sessions and can import ChatGPT and Gemini histories in ~60 seconds while slashing Claude Code’s memory usage 40×, Redis‑backed Memento MCP offers fragment‑based long‑term agent memory, offline RAG tools like ConceptLens move knowledge graphs to the laptop, and WebMCP lets websites register structured tools for agents even as people immediately worry about dark patterns and insecure internal APIs.
What This Means
Power is drifting toward whoever can safely wire long‑lived agents into messy real‑world systems—defense clouds, Chinese/open‑weight stacks, browser‑level tool protocols—while the underlying security and reliability story is clearly nowhere near as mature as the marketing.
On Watch
Interesting
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
Sources
Key Events
On Watch
Interesting