Claude just bought itself a country’s worth of GPUs, GPT‑5.5 quietly became the sensible default brain, and Tencent’s Hy3 preview model is suddenly hoovering up tokens on OpenRouter. At the same time, Qwen/Gemma/DeepSeek plus aggressive inference tricks have made open/local stacks fast and cheap enough that security, observability, and platform shenanigans (Chrome’s Gemini Nano, Ollama’s Bleeding Llama) now matter more than raw model size.
The frontier is multipolar; the real fight is over who controls the pipes and the logs, not just the weights.
Key Events
/Anthropic partnered with SpaceX to access over 220,000 NVIDIA GPUs for Claude via the Colossus 1 supercluster.
/Claude Code doubled its 5‑hour rate limits for Pro, Max, and Team plans, easing previous usage bottlenecks.
/Tencent’s Hy3 preview model processed 3.66T tokens on OpenRouter, topping the weekly leaderboard.
/Google Chrome silently installed a roughly 4GBGemini Nano model on users’ machines, raising privacy and EU legal concerns.
/Ollama disclosed a critical unauthenticated memory leak vulnerability dubbed “Bleeding Llama,” affecting local LLM deployments.
Report
Everyone is staring at GPT‑5.5 benchmarks while the real story is that Claude and a Chinese preview model just hijacked the compute and usage charts.
Underneath, open and local stacks quietly solved speed, and the weak links are now security, observability, and the platforms you thought were boring.
thee poles, not two: claude, gpt‑5.5, hy3
Anthropic didn’t just raise limits; it locked in a sovereign‑scale cluster by partnering with SpaceX’s Colossus 1, gaining access to over 220,000 NVIDIA GPUs for Claude.
That extra headroom immediately showed up as doubled 5‑hour rate limits and removed peak‑hour throttling for Claude Code’s Pro, Max, and Team plans, fixing one of users’ loudest complaints.
OpenAI counter‑programmed on the quality axis instead of the hardware axis, with GPT‑5.5 Instant cutting hallucinated claims by 52.5% on high‑stakes prompts relative to its predecessor.
GPT‑5.5 simultaneously leads at least one marketplace in both usage and earnings and is reported to be roughly 4–5x cheaper than Claude Mythos for comparable capability.
The curveball is Tencent’s Hy3 preview model, which just processed 3.66T tokens on OpenRouter and grabbed the top leaderboard slot, turning what looked like a duopoly into a three‑pole race in actual usage.
the open/local triad finally looks like a stack
On the open/local side, Qwen 3.6 27B is now effectively a coding and agent workhorse, with Multi‑Token Prediction delivering roughly 2.5x faster inference on supported setups.
The same model has demonstrated context windows up to 262k tokens on 48GB GPUs, which was frontier‑only territory not long ago. Benchmarks and user reports consistently show Qwen 3.6 beating Gemma 4 on coding and agentic tasks, while Gemma is preferred for planning, language nuance, and emotional tone.
Gemma 4’s design—multi‑token prediction drafters plus decoupled attention in the 26B variant—lets it feel “bigger than it is,” helping models like Gemma‑4‑31B land high on code leaderboards despite struggling with tool calls and intricate coding.
DeepSeek V4 rounds out the stack as a terminal‑first coding agent that many users treat as a GPT‑3.5‑class but cheaper model, with the company reportedly nearing a $45B valuation in its first fundraising round.
throughput is becoming a software problem
Inference speed is starting to look like a software problem, not a hardware ceiling: the open‑source GB10 Solution Atlas engine pushes Qwen‑class 35B models past 100 tokens per second in FP8 while avoiding the PyTorch stack entirely.
Multi‑Token Prediction does similar magic on mainstream stacks, giving Qwen 3.6 roughly a 2.5x decoding speedup on GPUs like the V100 when enabled.
The catch is that MTP still needs around 3GB of extra VRAM headroom and doesn’t reduce core model compute, so it’s great for wall‑clock latency but less of a FLOPs savings hack than people assume. vLLM 0.20.0 arrived with day‑0 MTP support for Gemma 4 and turnkey Docker images, while AMD’s MI355x on SGLang reportedly delivered more than a 10x throughput jump per GPU since launch.
On the training side, NVIDIA and Unsloth’s recipe of packed‑sequence metadata caching and better MoE routing is good for about a 25% LLM fine‑tuning speedup, which compounds with all these inference tricks.
agents are mostly rag with a gpu addiction
The agent ecosystem looks enormous at first glance: LangChain has already crossed 1B downloads despite being only a few years old. Major clouds have converged on AG‑UI as a shared frontend for agents, and LangSmith now tracks around 300M agent runs per month at Clay.
Under the hood, research using CrewAI and LangGraph shows these agents often burn far more compute than simple chatbots for relatively modest gains, which is why tools like Shadow now exist just to regression‑test their behavior.
A lot of what’s marketed as “autonomous agents” is effectively RAG with extra ceremony—benchmarks and practitioner writeups bluntly describe many agents as glorified retrieval wrappers around LLMs.
Teams are discovering observability the hard way, retrofitting logging and metrics after agent incidents, hence projects like MetaLens for Metabase‑based debugging and the broader push to treat observability, drift, and performance as first‑class design constraints instead of afterthoughts.
platform creep and security go from vibes to exploits
The browser is now an LLM runtime whether you asked for it or not: Chrome has been caught silently dropping a roughly 4GB Gemini Nano model onto users’ machines, triggering EU legal questions about consent and data use.
On the “local is safer” side, Ollama just shipped a critical unauthenticated memory‑leak bug nicknamed Bleeding Llama, which can expose sensitive data from local LLM sessions if unpatched.
Developer tools aren’t better: VS Code’s Copilot integration started auto‑adding itself as a commit co‑author, while users report growing distrust of Microsoft’s Copilot and Windows 11 changes in general.
At the application layer, people are accidentally leaking API keys through tools like Cursor and Copilot, while persistent‑memory agents have been shown vulnerable to deliberate memory‑poisoning attacks and data exfiltration.
Even core model training has security and ethics overhangs, with lawsuits alleging large‑scale scraping of protected books to train models like Meta’s, and users expressing fresh concern about how much control they really have over where their data lands.
What This Means
The frontier story this month is less “one model to rule them all” and more a three‑way tension between hyperscale closed labs, a suddenly serious China stack, and open/local systems that have quietly fixed throughput while ops, security, and observability lag behind.
On Watch
/IBM, Cleveland Clinic, and RIKEN’s quantum hardware simulation of a 12,635‑atom protein complex is an early signal that quantum+AI workflows may move from demo to domain tool faster than expected.
/OpenAI’s Multipath Reliable Connection (MRC) protocol for large AI training clusters could become the de facto networking substrate for multi‑rack training if it spreads beyond the Open Compute Project niche.
/Apple’s planned Siri revamp that lets users pick from external AI services will test whether it becomes an AI router for other labs’ models or quietly launches a credible first‑party stack at iOS scale.
Interesting
/GPT-5.5 is considered the strongest model out of the box, but performs similarly to GPT-5.4 when given specific skills.
/The paper titled "Thinking with Visual Primitives" aims to enhance spatial reasoning in multimodal models, showcasing DeepSeek's commitment to advancing AI capabilities.
/DeepSeek is recognized for its cost-effectiveness and strong performance in coding tasks, often rivaling GPT-4, appealing particularly to students.
/Anthropic's partnership with SpaceX includes a $200 billion commitment to Google Cloud over five years, with Google investing $40 billion in Anthropic.
/OpenAI has introduced a new networking protocol called MRC for large-scale AI training clusters, now available through the Open Compute Project.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Anthropic partnered with SpaceX to access over 220,000 NVIDIA GPUs for Claude via the Colossus 1 supercluster.
/Claude Code doubled its 5‑hour rate limits for Pro, Max, and Team plans, easing previous usage bottlenecks.
/Tencent’s Hy3 preview model processed 3.66T tokens on OpenRouter, topping the weekly leaderboard.
/Google Chrome silently installed a roughly 4GBGemini Nano model on users’ machines, raising privacy and EU legal concerns.
/Ollama disclosed a critical unauthenticated memory leak vulnerability dubbed “Bleeding Llama,” affecting local LLM deployments.
On Watch
/IBM, Cleveland Clinic, and RIKEN’s quantum hardware simulation of a 12,635‑atom protein complex is an early signal that quantum+AI workflows may move from demo to domain tool faster than expected.
/OpenAI’s Multipath Reliable Connection (MRC) protocol for large AI training clusters could become the de facto networking substrate for multi‑rack training if it spreads beyond the Open Compute Project niche.
/Apple’s planned Siri revamp that lets users pick from external AI services will test whether it becomes an AI router for other labs’ models or quietly launches a credible first‑party stack at iOS scale.
Interesting
/GPT-5.5 is considered the strongest model out of the box, but performs similarly to GPT-5.4 when given specific skills.
/The paper titled "Thinking with Visual Primitives" aims to enhance spatial reasoning in multimodal models, showcasing DeepSeek's commitment to advancing AI capabilities.
/DeepSeek is recognized for its cost-effectiveness and strong performance in coding tasks, often rivaling GPT-4, appealing particularly to students.
/Anthropic's partnership with SpaceX includes a $200 billion commitment to Google Cloud over five years, with Google investing $40 billion in Anthropic.
/OpenAI has introduced a new networking protocol called MRC for large-scale AI training clusters, now available through the Open Compute Project.