TL;DR
The big moves this month weren’t new IQ points, they were in the plumbing: harnesses, routers, compression tricks, and local rigs that decide how far you can actually push the models you already have.
Open weights and local-first stacks are eating more of the coding and reasoning workload just as token economics and real security failures start to bite, so power is drifting away from any single frontier API toward whoever controls the infrastructure around it.
Key Events
Report
Everyone’s obsessing over frontier model IQ, but the sharpest moves this month were in the infrastructure that decides how, where, and at what price that IQ actually runs.
AGI talk kept getting louder while control planes, compression tricks, and security failures quietly redefined the real constraint surface.
LangChain users estimate that about 70% of failures in their systems come from agent orchestration bugs rather than LLM answers. Google tested 180 different agent setups and found that multi-agent configurations made performance on sequential tasks about 70% worse on average, even with strong base models like Gemini 3.1.
OpenClaw blew up as the canonical overpowered harness: a personal agent with full local system access, 250K GitHub stars, and skills hand-authored by humans, but was tagged a security nightmare with privilege escalation and sandbox escape findings.
Anthropic responded by banning OpenClaw from normal Claude quotas, then cutting off third-party harnesses from Claude subscriptions starting April 4, pushing people toward its own Managed Agents runtime instead.
At the same time, LangGraph shipped 8-node StateGraphs that parse gnarly government PDFs and a memory firewall that intercepts about 90.5% of poisoning attempts, so the hard problems are increasingly about state machines and guardrails rather than raw model quality.
GLM‑5.1 landed as an MIT-licensed open-weight MoE with 744B total parameters, about 40B active, and a 200K context window, aimed squarely at coding and agents.
It scored 58.4 on SWE‑Bench Pro, making it the best model on that benchmark and roughly matching Claude Opus 4.6 at about one-third of the price.
Kimi K2.6 posted a 58.6 SWE‑Bench Pro score, beating Opus 4.6 and GPT‑5.4 on that metric while being around 76% cheaper than Opus 4.7 per input token and released as open source.
Alibaba’s Qwen3.6‑35B‑A3B is another Apache-licensed sparse MoE with 3B active parameters, while Qwen‑3.6‑Plus became the first model to process over 1 trillion tokens in a single day.
Despite the benchmark wins, users complain that GLM‑5.1 is slow with tight parallel limits and that Kimi K2.6 underperforms on messy real-world tasks, even as OpenAI’s gen‑AI web traffic share shrinks and Gemini’s climbs from 6% to 25.46% over the past year.
Muse Spark from Meta Superintelligence scored 52 on the Artificial Analysis Intelligence Index, just behind Gemini 3.1 Pro and GPT‑5.4 while using over 10x less compute than Llama 4 Maverick, but it shipped with neither open weights nor a general API.
Local-first stacks quietly got a huge upgrade: Google’s TurboQuant compressed KV caches by at least 6x and sped up decoding by up to 8x with no reported accuracy loss, enabling big models like Qwen 3.5 to run on standard hardware.
On Apple Silicon, MLX plus DFlash doubled Qwen 3.5‑27B generation speed on an M5 Max, with one user hitting 72 tokens per second from Qwen3‑Coder‑Next on a MacBook Pro with 128GB unified memory.
A separate experiment ran a 397B-parameter MoE model by streaming its 209GB of weights from SSD in real time on a MacBook with 24–48GB RAM, sustaining about 1.77 tokens per second.
Meanwhile, GPU prices are expected to rise significantly by early 2026, with customers already paying around $14 per hour for AWS GPU spot instances, and users are urging each other to buy consumer GPUs to escape volatile cloud pricing.
While labs talk about AGI, the economics shifted: ChatGPT’s new Pro tier starts at $100 a month with roughly 5–10× the usage of Plus, and OpenAI is rolling ads into free and Go tiers based on prompt relevance.
Qwen‑3.6‑Plus just became the first model to chew through over 1 trillion tokens in a single day, Meta logged 60 trillion tokens in 30 days internally, and researchers are openly calling this a compute capacity trap with current usage heavily subsidized.
In parallel, the attack surface of this ecosystem was laid bare when the LiteLLM PyPI package—pulled in by about 97 million monthly downloads—was compromised in versions 1.82.7 and 1.82.8, exfiltrating SSH keys and cloud credentials from over 1,000 environments within three hours.
The same threat actor had previously hidden malware in Telnyx packages using WAV steganography, axios on npm with over 100 million weekly downloads carried install-time malware, and Vercel’s OAuth breach exposed environment variables for hosted apps.
On the application side, Claude Code’s 512,000‑line source leaked via an npm map file, OpenClaw’s audits found privilege escalation and sandbox escapes in a tool with full desktop access, and MCP is wiring agents straight into internal systems via 177,000 registered tools.
Regulators and institutions are responding with bans and constraints—Health NZ told staff to stop using ChatGPT for clinical notes, and Wikipedia now formally prohibits AI‑generated article text—treating these systems as too unpredictable to trust unguarded.
What This Means
Model IQ is no longer the main variable; control planes, cost structures, and security posture are. The pattern across stacks is that power is moving from single frontier APIs toward whoever owns the routers, harnesses, and local compute.
On Watch
Interesting
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
Sources
Key Events
On Watch
Interesting