There isn’t a single ‘best’ model anymore: Gemini owns math and memory, Cursor/Qwen/GLM own coding, and cheap Chinese models quietly own a lot of the tokens. Agents are now good enough to ship 295k‑line apps and bad enough to wipe production databases, just as compute, data, and alignment all start to feel like real constraints.
The interesting work has moved from picking one big model to orchestrating a messy stack of specialized models, local runtimes, and safety tooling.
Key Events
/Gemini 3.2 Flash became the only model reported to solve IMO 2025 Problem 6 and scored 96.4% on LongMemEval.
/Cursor shipped Composer 2.5 and an in‑house model that outperforms Claude Opus 4.7 and GPT‑5.5 on coding benchmarks.
/Qwen 3.7 launched, while Qwen 35B A3B surpassed Gemma4 26B on coding tasks.
/Anthropic agreed to acquire @stainlessapi and will discontinue its popular SDK generator.
/OpenAI shut down its hosted fine‑tuning service, stranding startups that depended on it for model customization.
Report
Everyone is still arguing about which model is "ahead"; this month’s data says that question stopped making sense. The real split is between stacks that exploit a messy portfolio of specialized models, local runtimes, and agents, and stacks that still pretend one frontier API can do everything.
the end of the 'one best model' myth
Gemini 3.2 Flash is currently the only model reported to solve IMO 2025 Problem 6, staking out the extreme‑reasoning niche. It also hit 96.4% on the LongMemEval conversational memory benchmark, so long‑horizon dialogue looks like a Gemini specialty rather than a generic LLM feature.
On coding, Cursor’s new in‑house model is reported to outperform Claude Opus 4.7 and GPT‑5.5 on benchmarks, with Composer 2.5 marketed as its most powerful long‑running model so far.
Open‑weight and small models are not just toys: GLM 5.1 plus Bitloops scored 88 on SWE‑bench Verified, and GPT‑5.4 nano reached 76.4% on SWE‑bench.
Qwen 35B A3B now beats Gemma4 26B on coding tasks, making "best model" a workload‑dependent statement rather than a leaderboard position.
agents that ship features and drop tables
Agentic coding is now producing entire products: one dev reports building a 295k‑line platform in a month with Cursor, with first drafts arriving about 4× faster than before.
Composer 2.5’s auto mode is reportedly good enough for everyday coding when paired with Claude Code, and Codex’s new /goal command lets it grind on long‑running objectives without constant babysitting.
Smaller backends are also viable, with a 4B‑parameter coding agent scoring 87% on benchmarks and tools like Bitloops giving open‑weight models frontier‑adjacent coding performance.
The failure modes are correspondingly bigger: a Cursor agent wired through MCP deleted a Railway production database in nine seconds, and Copilot Cowork has already raised alarms over potential file exfiltration.
Debugging talk has shifted from stack traces to system design, with tools like RAG Debugger, Armorer, and LangSmith focusing on observability, run records, and pipeline evaluation for agents rather than just raw model outputs.
alignment is tightening while an uncensored market blooms
Mainstream APIs are quietly getting stricter: newer ChatGPT and Claude versions are reported to refuse more content and prepend longer ethical disclaimers than earlier releases.
Regulated users are leaning into this, as seen in 30‑plus open‑source PII models for redacting clinical discharge summaries crossing a million downloads in 20 days.
Medical folks are already pointing out that AI in medicine is likely to fail on calibration before eloquence, which makes polished mis‑calibration a concrete safety problem rather than a hypothetical.
In parallel, the uncensored segment is formalizing itself, with models like Gemma‑4‑Gembrain‑31B‑it‑uncensored‑heretic boasting a refusal rate of just 13 out of 100 and communities openly discussing LTX 2.3 for adult content workflows.
Agent platforms such as OpenClaw sit uneasily in the middle, drawing 370k GitHub stars while users simultaneously call out the need for stronger moderation of agent‑generated content.
open weights, data scarcity, and the slow squeeze
Qwen is the current open‑weight flagship: 3.7 just landed, 35B A3B is beating Gemma4 26B on coding, and 3.6 27B can hit roughly 1260 tok/s prefill on a single RTX 3090 via MTP.
At the same time, users are openly worried that the Qwen team may stop releasing large open models, echoing a broader fear that big labs will pull back open weights once their ecosystems dominate.
On the data side, a new multilingual corpus of 9.8 million CC0 documents landed just as more websites block scrapers, making high‑quality open datasets feel like an appreciating asset rather than an infinite free good.
Labs are also leaning harder on synthetic data and targeted sets like the Slop Bucket dataset of undesirable actions, blurring the line between training on the world and training on previous models’ judgments.
Commenters still talk as if open datasets and community fine‑tuning will inevitably sustain the local LLM ecosystem, but that optimism is increasingly at odds with these supply‑side signals.
compute is bending, not breaking
The hardware story is bifurcating: H100s remain expensive and hard to access on‑demand, while China’s LineShine supercomputer sidesteps US GPU bans with 2.4 million Armv9 cores delivering 1.54 exaflops.
At the other extreme, Tether fine‑tuned a 13B‑parameter model directly on an iPhone 16, and users report Qwen 3.6 running locally at roughly 2× speed on only 18GB of RAM.
Inference optimizations like MTP now routinely yield around 2× speedups, with Qwen 3.6 27B decoding near 73 tok/s on a single RTX 3090 and similar gains on Strix Halo and A10G. That’s making 20–30B local models feel "fast enough" for agents and chat even as cards like the RX 6800 XT see no benefit and many devs still describe local setups as operationally painful.
Meanwhile, cheap non‑US APIs are becoming the token workhorses—Step 3.5 Flash, MiniMax M2.5, and Ling‑2.6 already account for about 3.15 trillion tokens on OpenRouter, and DeepSeek V4 advertises useful performance at roughly $1 per month.
What This Means
The center of gravity has shifted from chasing a single frontier model to orchestrating a heterogeneous mess of specialized models, agents, and runtimes under tightening safety, data, and hardware constraints. The consensus story of "we’re early" misses that coding, inference, and open‑weight infrastructure are already in late‑stage optimization fights while alignment, governance, and product UX are still in their experimental phase.
On Watch
/Anthropic’s acquisition of @stainlessapi and the shutdown of its SDK generator has sparked calls to open‑source the tool, setting up a test case for how much control labs exert over the emerging MCP/tooling ecosystem.
/AI21 Labs laid off 110 people and is pivoting from selling general language models to AI agents, an early sign that some foundation‑model bets are being restructured around higher‑level workflows.
/Public sentiment is flashing warning lights, with backlash against AI being "forced" into daily life culminating in incidents like Eric Schmidt getting a hostile reception at a graduation speech over his AI advocacy.
Interesting
/Users have noted that Qwen models, particularly 3.6 and 3.5, excel in visual tasks, outperforming competitors in image understanding.
/Elmo, an open-source tool, allows users to scrape AI responses and evaluate prompts against various models accessed via OpenRouter, enhancing user interaction.
/A recent paper emphasizes efficient training methods for imperfect models, focusing on low Lipschitz constants for stability.
/Many failures in multi-agent systems stem from assumption propagation failures rather than hallucinations, highlighting a critical area for improvement.
/The self-evolving nature of the flux-genotype AI kernel represents a significant advancement in AI technology, allowing for dynamic model adaptation.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Gemini 3.2 Flash became the only model reported to solve IMO 2025 Problem 6 and scored 96.4% on LongMemEval.
/Cursor shipped Composer 2.5 and an in‑house model that outperforms Claude Opus 4.7 and GPT‑5.5 on coding benchmarks.
/Qwen 3.7 launched, while Qwen 35B A3B surpassed Gemma4 26B on coding tasks.
/Anthropic agreed to acquire @stainlessapi and will discontinue its popular SDK generator.
/OpenAI shut down its hosted fine‑tuning service, stranding startups that depended on it for model customization.
On Watch
/Anthropic’s acquisition of @stainlessapi and the shutdown of its SDK generator has sparked calls to open‑source the tool, setting up a test case for how much control labs exert over the emerging MCP/tooling ecosystem.
/AI21 Labs laid off 110 people and is pivoting from selling general language models to AI agents, an early sign that some foundation‑model bets are being restructured around higher‑level workflows.
/Public sentiment is flashing warning lights, with backlash against AI being "forced" into daily life culminating in incidents like Eric Schmidt getting a hostile reception at a graduation speech over his AI advocacy.
Interesting
/Users have noted that Qwen models, particularly 3.6 and 3.5, excel in visual tasks, outperforming competitors in image understanding.
/Elmo, an open-source tool, allows users to scrape AI responses and evaluate prompts against various models accessed via OpenRouter, enhancing user interaction.
/A recent paper emphasizes efficient training methods for imperfect models, focusing on low Lipschitz constants for stability.
/Many failures in multi-agent systems stem from assumption propagation failures rather than hallucinations, highlighting a critical area for improvement.
/The self-evolving nature of the flux-genotype AI kernel represents a significant advancement in AI technology, allowing for dynamic model adaptation.