Chinese and open models now beat the US labs on coding benchmarks, but serious users still live inside Claude, Cursor, and a few routing setups while everything underneath—agents, infra, security, and costs—looks much shakier than the marketing. The real game is shifting from single-model IQ to who can orchestrate multi-model systems, tame inference bills, and keep this increasingly AI-soaked stack from falling over.
It feels less like a clean AGI turning point and more like a messy, multipolar software era where infrastructure and trust decide who actually wins.
Key Events
/Kimi K2.6 launched as an open coding model, hitting 58.6 on SWE‑Bench Pro and pricing at $0.95/M input and $4/M output tokens.
/GLM‑5.1 was released with 744B parameters and reported SWE‑Bench Pro wins over Claude Opus 4.6 and GPT‑5.4.
/Anthropic expanded its Amazon deal to secure up to 5 gigawatts of compute as Opus 4.7 increased token usage and topped the LLM Debate Benchmark.
/GitHub paused new signups for Copilot Pro to protect reliability and announced a shift to token‑based billing for Copilot.
/AI dev platform Lovable exposed all projects created before Nov 2025 via broken object‑level authorization, affecting every authenticated user.
Report
Benchmarks now crown Kimi K2.6 and GLM‑5.1 as coding SOTA, while the tools serious devs reach for are still Claude, Cursor, and a handful of routers.
Underneath that leaderboard, agent swarms, security, and compute economics all look far more fragile than the launch threads admit.
multipolar coding sota versus actual workflows
Kimi K2.6 posts a 58.6 SWE‑Bench Pro score, beating Claude Opus 4.6, GPT‑5.4, and Gemini 3.1 Pro on standard coding benchmarks. GLM‑5.1 arrives with 744B parameters and reports stronger SWE‑Bench Pro performance than Opus 4.6 and GPT‑5.4.
It also advertises decoding at 45 tokens per second and prefill at 1350 tokens per second, positioning it as a fast, server‑side coder. On paper this makes coding SOTA look Chinese‑ and open‑tilted, especially with DeepSeek’s Kimi pitched as GPT‑5.4‑level coding at roughly 65–76 percent below Opus 4.7’s cost.
In practice, power users still report preferring Claude (often via Claude Code or Cursor) for deep debugging and multi‑file work, saying Kimi’s real‑world coding feels only marginally better than Opus 4.6 and GLM struggles with reasoning.
Even inside Google, staff reportedly use Claude daily while Gemini 3.1 Pro trails Kimi on SWE‑Bench Pro and Antigravity draws complaints about lag, outages, and restrictive limits, prompting a DeepMind strike team and Sergey Brin’s involvement to rescue the coding stack.
agent swarms that mostly fail at the glue
Kimi K2.6 isn’t just a benchmark model; it can execute more than 4,000 tool calls from a single prompt across multiple languages and sustain long‑horizon runs without human intervention.
It has also been demoed as a swarm controller for roughly 300 parallel sub‑agents. Claude Code already orchestrates cheaper models like Qwen 3.6 as subagents, reportedly cutting Opus token usage by around 30× per task while keeping a high‑IQ controller in charge.
LangGraph is held up as the production‑ready alternative, with multi‑agent screening pipelines and richer failure‑recovery than demo graphs, but it still sits on top of the same brittle orchestration patterns.
The LangChain community reports that about 70 percent of failures in LangChain‑based multi‑agent systems come from orchestration bugs rather than model errors, and debugging pain pushes teams back to plain Python or TypeScript.
Security layers like Vaultak now sit in front of LangChain agents to monitor actions and roll back policies, while OpenClaw’s agentic automation—despite strong Kimi scores on its ClawMark benchmark and big cost savings—still gets labeled toy‑phase and raises alarms about arbitrary code execution and potential illegal workflows.
Anthropic just locked in up to 5 gigawatts of compute from Amazon to train and serve Claude, staking a power‑plant‑scale bet on frontier models.
At the same time, users see Opus 4.7 consuming noticeably more tokens for both text and images than Opus 4.6, with reports of higher costs and some regressions in hallucinations and accuracy.
Inference bills are already nearing about 10 percent of engineering headcount costs in some teams, so rising token usage lands as a material line item rather than background noise.
Against that backdrop, DeepSeek’s Kimi line is marketed as GPT‑5.4‑level coding at roughly 65–76 percent lower cost than Opus 4.7, while Kimi K2.6 and GLM‑5.1 match or beat Opus 4.6 on SWE‑Bench Pro at much cheaper price points.
The Fidler Sanja conversation adds a further wrinkle by arguing that transformers on today’s digital silicon may be nearing their limit for symbolic language, making the current compute arms race look more like an efficiency contest than a guaranteed path to dramatically new capabilities.
local and open quietly eating into the cloud
Local stacks are no longer toys: developers report Sonnet‑class performance on Macs with mid‑range RAM, using models like Qwen 3.6 and other local LLMs for serious workflows.
LM Studio users are running Qwen3.5‑0.8B at roughly 193 tokens per second on a Mac, showing how far tiny local models have come. The same ecosystem plugs Qwen 3.6 in as a Claude Code subagent, reportedly saving around 30× Opus tokens per task by offloading grunt work.
Llama.cpp continues to anchor fast local inference, with reports of around 43 tokens per second on a 5090 GPU and successful deployments like a 5G fault‑diagnosis RAG built on Llama 3.2 3B. Ollama and LM Studio often get first‑class integration in open‑source tools despite llama.cpp’s speed, while users complain about Ollama’s performance gaps, varying memory usage across quantized models, and fiddly features like enabling vision in Qwen GGUF.
On the edge of this trend, an AI drug‑discovery platform now runs entirely on Apple Silicon, generating candidates in about seven seconds, and TRELLIS.2 was ported to Apple Silicon via PyTorch MPS, signaling that serious ML workloads are moving off NVIDIA‑only infrastructure.
platform ai and trust signals melting down
GitHub has reoriented its homepage around AI and collaboration while pausing new signups for Copilot Pro to preserve reliability and moving Copilot toward token‑based billing.
Yet only about 1 percent of AI‑generated repositories pass production‑readiness checks, GitHub stars are widely described as gamed and meaningless, and privacy worries over training on private repos push some teams toward self‑hosted alternatives even as Copilot’s inline UX stays strong.
At the same time, Lovable’s broken object‑level authorization exposed all projects created before November 2025 to any authenticated user, and the EU’s official age‑checking app was hacked in about two minutes.
Vercel added to the list with a breach triggered by an employee mistake and a ransom demand around two million dollars, underlining how immature a lot of AI‑centric web infra still is.
Higher up the stack, ChatGPT and Codex outages, Claude onboarding downtime, and concerns that heavy use of chatbots like ChatGPT and Grok may erode critical thinking land awkwardly next to Hyatt’s ChatGPT Enterprise rollout, half of employed Americans using AI at work, and Deezer’s finding that roughly 44 percent of daily uploads are AI‑generated songs.
What This Means
Coding SOTA is fragmenting into a multipolar, often Chinese‑tilted leaderboard while real workflows consolidate around a few orchestrators, routers, and increasingly capable local runtimes running on shaky economics and security. The center of gravity is drifting from single‑model IQ to whoever can tame orchestration, infra, and trust before transformer scaling and the current compute binge hit their natural limits.
On Watch
/Anthropic’s Mythos model is already in use at the NSA despite being blacklisted as a supply‑chain risk, with internal claims it could replace junior engineers and regulators eyeing it for banking exposure, so any concrete capability or policy leak could rapidly change the risk calculus.
/The push toward Physical AI and spatial intelligence, with arguments that transformers on current digital silicon are nearing their symbolic language ceiling, hints at a possible medium‑term pivot in both architectures and hardware substrates.
/Router layers like OpenRouter, already handling over 70 trillion tokens per month and letting users prioritize or blacklist models for privacy and performance, could harden into critical infra if routed model quality converges.
Interesting
/Users have noted that Qwen 3.5B found multiple bugs that Claude Opus 4.7 could not detect, showcasing its debugging capabilities.
/China's domestic chips are projected to capture 41% of the AI server market by 2025, indicating a shift in the global AI hardware landscape.
/A new reasoning model, Chaperone-Thinking-LQ-1.0, has been open-sourced and achieves 84% on MedQA with a reduced model size, showcasing advancements in AI reasoning capabilities.
/Kimi K2.6 autonomously optimized an 8-year-old financial matching engine, showcasing AI's potential in software maintenance.
/The processing of over 70 trillion tokens per month on platforms like OpenRouter indicates a massive demand for AI solutions, necessitating robust infrastructure.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Kimi K2.6 launched as an open coding model, hitting 58.6 on SWE‑Bench Pro and pricing at $0.95/M input and $4/M output tokens.
/GLM‑5.1 was released with 744B parameters and reported SWE‑Bench Pro wins over Claude Opus 4.6 and GPT‑5.4.
/Anthropic expanded its Amazon deal to secure up to 5 gigawatts of compute as Opus 4.7 increased token usage and topped the LLM Debate Benchmark.
/GitHub paused new signups for Copilot Pro to protect reliability and announced a shift to token‑based billing for Copilot.
/AI dev platform Lovable exposed all projects created before Nov 2025 via broken object‑level authorization, affecting every authenticated user.
On Watch
/Anthropic’s Mythos model is already in use at the NSA despite being blacklisted as a supply‑chain risk, with internal claims it could replace junior engineers and regulators eyeing it for banking exposure, so any concrete capability or policy leak could rapidly change the risk calculus.
/The push toward Physical AI and spatial intelligence, with arguments that transformers on current digital silicon are nearing their symbolic language ceiling, hints at a possible medium‑term pivot in both architectures and hardware substrates.
/Router layers like OpenRouter, already handling over 70 trillion tokens per month and letting users prioritize or blacklist models for privacy and performance, could harden into critical infra if routed model quality converges.
Interesting
/Users have noted that Qwen 3.5B found multiple bugs that Claude Opus 4.7 could not detect, showcasing its debugging capabilities.
/China's domestic chips are projected to capture 41% of the AI server market by 2025, indicating a shift in the global AI hardware landscape.
/A new reasoning model, Chaperone-Thinking-LQ-1.0, has been open-sourced and achieves 84% on MedQA with a reduced model size, showcasing advancements in AI reasoning capabilities.
/Kimi K2.6 autonomously optimized an 8-year-old financial matching engine, showcasing AI's potential in software maintenance.
/The processing of over 70 trillion tokens per month on platforms like OpenRouter indicates a massive demand for AI solutions, necessitating robust infrastructure.