TL;DR
This month’s real story is in the plumbing: decoding hacks, mega‑clusters, cheap Qwen/DeepSeek‑class challengers, and long‑memory agents quietly changed what the frontier feels like, while the AGI discourse mostly stayed vibes‑only. Training is concentrating into power‑plant‑scale facilities even as local inference on consumer GPUs and Apple Silicon gets fast and cheap enough to be genuinely useful.
The decisive edge is drifting toward runtimes, retrieval/memory, and security posture, not whose base model has the flashiest benchmark chart.
Key Events
Report
Most of the interesting progress this month is not new frontier models; it is the stack quietly removing its own bottlenecks. Decoding tricks, mega‑clusters, and agents with memory are doing more work than another benchmark tweet.
On one end, frontier labs are literally building power‑plant‑scale clusters: Anthropic leased the entire Colossus 1 facility from SpaceX, with over 220,000 NVIDIA GPUs and about 300MW of power in a multi‑year deal.
A separate $16B AI data center in Michigan and the fact that data‑center construction spending has now surpassed office construction mark the same direction of travel.
GPU scarcity shows up downstream, with rental prices for NVIDIA B200s jumping 114% in six weeks as demand for AI compute spikes. At the other end of the barbell, a used Tesla P100 with 16GB VRAM sells for around $70 and is considered viable for hosting LLMs, while RTX 5090‑tuned NVFP4 models like LTX 2.3 and Qwen3.6 35B run 200k‑token contexts on a single card.
AMD’s ROCm stack reporting 75× performance gains on DeepSeek V4 in two weeks plus GB300 NVL72 GPUs running 2.7× faster than GB200 in practice underline how much of the remaining gap is now software and system design, not just silicon.
Multi‑Token Prediction turned several models from merely usable into genuinely snappy, with Qwen 3.6 27B getting about 2.5× faster inference and 80+ tokens per second on consumer GPUs, and Gemma 4 seeing up to 3× token‑per‑second gains.
Llama.cpp’s beta MTP support lets Gemma 26B draft tokens roughly 40% faster, and these day‑0 MTP releases landed simultaneously in transformers, MLX, and vLLM runtimes.
DFlash and speculative decoding push further: Gemma 4 26B has been clocked at around 600 tok/s on an RTX 5090 via speculative decoding, and DFlash‑based setups report up to 8.5× decoding speedups in other contexts.
The same speculative‑decoding idea is now wired into RL training, with reports of 2.5× faster end‑to‑end RL at 235B scale without changing model behavior.
Users are also hitting the sharp edges: DFlash degrades on very long contexts beyond ~20k tokens, MTP can hurt creative tasks, and these tricks add VRAM overhead and finicky model‑config requirements.
DeepSeek V4 Pro now matches GPT‑5.2 on the FoodTruck Bench while being around 17× cheaper, and users call it the best open‑source coding model, outperforming Opus 4.7 and GPT‑5.5 on their workloads.
Qwen 3.6 27B is reported to beat Codex GPT‑5.5 and Claude Opus 4.7 on certain coding tasks, while still running at 54–135 tokens per second on commodity GPUs and even fitting into 12GB VRAM for fast local use.
Kimi K2.6 is roughly five times cheaper than Opus 4.7 while scoring competitively on debate and coding benchmarks, and GLM‑5.1 has been floated as a potential Claude killer for coding with continuous‑operation agents.
Even the incumbents are repositioning: GPT‑5.5 is estimated to be 4–5× cheaper than Claude Mythos at comparable capability, while its Instant variant cuts hallucinated claims by 52.5% on high‑stakes prompts.
The net effect on the ground is that users see Codex overtaking Claude Code in downloads and reliability, DeepSeek and Qwen displacing GPT/Claude for day‑to‑day coding, and many now treating premium frontier models as an exception, not the default.
Hermes Agent processed about 271 billion tokens and became the most‑used model on OpenRouter over the last day, ahead of Claude Code and OpenClaw, with nearly 1,000 contributors extending its behavior.
Its 2.0 memory system adds long‑term recall via a knowledge‑graph‑style installer, mirroring a broader push toward persistent memory brokers and agentic vector databases for cross‑session context.
LangGraph is emerging as the runtime spine for this world, adding node‑level error handlers, checkpointing with rollbacks, and delta‑style storage channels under LangChain and other agent stacks.
At the protocol layer, MCP standardizes how models discover tools, authentication, and memory, from n8n workflows built from plain‑language descriptions to Exa MCP for people/company data and Cloudflare‑hosted memory servers with semantic search.
All of this is landing in a hostile environment where attackers have already poisoned Hugging Face and ClawHub with over 575 malicious skills, Chrome is silently pushing a 4GB Gemini Nano model to browsers, and even Edge stores passwords in cleartext memory, turning the AI stack itself into an attack surface.
WAN 2.2 remains the day‑to‑day SOTA for human‑centric video, with creators praising its handling of complex anatomy, prompt adherence, and character consistency despite GPU demands and clip‑length issues.
Kling 3.0 and Bach‑1.0 Preview push the other frontier, replacing green screens and props with 4K AI VFX that can drop some production costs from around $100,000 to $5 while delivering micro‑textures and crisp reflections.
Seedance 2.0 leans into narrative, powering nearly 50,000 AI microdramas on Douyin in a month, offering near‑infinite video length, one‑click cinemagraphs, and up to 90% cost reductions for film scenes.
On the tooling side, ComfyUI’s custom node packs give 72 building blocks for masking, segmentation, and inpainting, while workflows like SDXL epicrealism plus face inpainting still dominate precise edits for Netflix‑grade work where inpainting can represent half the pipeline.
Forge Neo is absorbing users from A1111 with better performance on tasks like Anima and easier installs, even as some missing samplers, controlnet quirks, and model regression reports keep ComfyUI the preferred playground for power‑users chasing maximum control.
What This Means
The center of gravity is sliding away from single closed models toward a stack where decoding tricks, mega‑clusters, cheap challengers, and long‑memory agents together shape real capability, from MTP/DFlash speedups and Colossus‑scale clusters to DeepSeek/Qwen price–performance and Hermes‑style agents with persistent memory. Progress over this period looked less like one headline model drop and more like a mesh of runtimes, infrastructure, and workflows in video, retrieval, and local inference quietly redefining what state of the art means in practice.
On Watch
Interesting
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
Sources
Key Events
On Watch
Interesting