There’s no single 'best model' anymore: GPT‑5.4, Gemini 3.x, Claude, DeepSeek, GLM‑5 and Qwen 3.5 all dominate different slices of the frontier, while open weights creep close enough that you can run near‑frontier intelligence at home if you can afford the memory. Coding, video, and even research are already machine‑first with humans in the verification loop, and the real choke points are shifting from model size to memory, evaluation, security, and politics.
In other words, the hard part is no longer getting the AI to do things – it’s deciding which of those things to trust and institutionalize.
Key Events
/Claude became the number one app on the U.S. App Store as 1.5M users left ChatGPT.
/OpenAI rolled out GPT‑5.4 with a 1M‑token context window across ChatGPT and Codex.
/Google’s Gemini 3.1 Pro and Gemini 3 Deep Think set new highs on ARC‑AGI‑2, at 77.1% and 84.6% respectively.
/DeepSeek V4 was announced as a multimodal, 1M‑context model projected to score 83.7% on SWE‑Bench.
/Zhipu AI launched GLM‑5, a 744B‑parameter open‑weight model scoring 50 on the Artificial Analysis Intelligence Index.
Report
The frontier no longer has a single winner; it looks more like a messy Pareto front than a podium. At the same time, models and GPUs are outrunning human verification, memory, and safety engineering by a wide margin.
the fractured frontier
GPT‑5.4 currently tops LiveBench and adds a 1M‑token context window to ChatGPT and Codex. Google’s Gemini 3.1 Pro and Gemini 3 Deep Think now dominate the ARC‑AGI‑2 benchmark, with Pro at 77.1% and Deep Think setting a new record for reasoning tests like ARC‑AGI‑2.
On coding‑heavy evals, DeepSeek V4 is reported at 83.7% on SWE‑Bench and MiniMax M2.5 at 80.2% verification, while Chinese open‑weights like GLM‑5 and Kimi K2.5 sit within single digits of closed models on artificial intelligence indices.
ARC‑AGI‑1 is already saturated above 95%, Confluence has pushed ARC‑AGI‑2 to 97.9%, and even François Chollet is simultaneously predicting AGI by 2030 and insisting benchmarks won’t define it, so 'frontier' now depends entirely on which leaderboard slice you pick.
near‑frontier goes local into a memory crunch
GLM‑5 is a 744B‑parameter open‑weight model trained on 28.5T tokens and designed explicitly for long‑horizon agentic coding and systems work, yet it’s available under open terms and even runs via NVIDIA NIM for free.
Alibaba’s Qwen 3.5 line has a 4B variant reported as roughly GPT‑4o‑class and a 397B multimodal MoE flagship, with open weights and FP8 or MXFP4 quantizations aimed at local deployment.
MXFP4 and related 4‑bit formats are hitting near‑bf16 quality while users report ~2K tokens/second on dual‑3090 rigs, and llama.cpp plus vLLM are pushing everything from MiniMax 2.5 to Qwen 3.5 onto single‑GPU and NVMe‑offloaded setups.
All of this collides with a worsening memory shock where DRAM has moved to 'hourly pricing', RAM is forecast to be constrained through 2028, and vendors say memory now eats roughly a third of PC bill‑of‑materials just as AI workloads demand ever more of it.
code and research are already machine‑first, human‑verification‑limited
Karpathy notes that around December, coding agents crossed a reliability threshold where they can handle long, multi‑step tasks autonomously, and his autoresearch loop now runs on the order of 100 experiments overnight on a single GPU.
Shopify’s Tobi Lutke reports a 53% performance gain on the Liquid codebase from that system, while Spotify says its best developers haven’t written a line of code since December because AI took over implementation.
Anthropic says 70–90% of the code for its future models is already produced by Claude, and internally about 80% of shipped code is attributed to Claude Code, mirroring patterns where Fortune‑500 engineers increasingly do review rather than writing.
At the same time, Claude Code has wiped production databases and 2.5 years of records, Amazon tied AWS outages to 'Gen‑AI assisted changes', controlled studies show AI‑assisted developers scoring 17% lower on comprehension with 30% higher defect risk in bad codebases, and teams report 'AI brain fry' from triaging machine‑written diffs.
the agent stack is sliding from protocols to dumb pipes (and getting burned)
Developers are quietly abandoning ornate tool protocols in favor of primitives models can actually use: CLIs wrapped as tools are reported to cut token costs by ~94% versus MCP servers, and formal tests show MCP‑style calls can be up to 32× more expensive than equivalent command‑line workflows.
Perplexity’s CTO has publicly pivoted away from MCP toward classic APIs and CLIs just as Google ships an MCP‑enabled CLI, highlighting the split between protocol maximalists and people who just want bash‑shaped tools that work.
Security scans of AI‑agent repos found vulnerabilities in 80% of projects, 38% of them critical, and 41% of official MCP servers expose tools with no authentication at all, which is exactly how you end up with agents quietly installing malware or exfiltrating data.
Meanwhile OpenClaw has more GitHub stars than React and the Linux kernel, its most‑downloaded marketplace skill is malware, roughly 15% of community skills contain malicious instructions, and over 220,000 agent instances were found exposed on the public internet without auth even as China and others move to restrict its use in government.
video, multimodal reasoning, and the verification ceiling
Seedance 2.0 can turn a single photo and prompt into a three‑minute cinematic one‑take in about two hours for roughly six dollars, and Jia Zhangke used it to create a feature‑length 'first AI movie' in three days.
Chinese studios are already using Seedance 2.0 to produce whole TV series while Disney and Paramount fire off cease‑and‑desist letters over trademarked characters, and the API launch is delayed explicitly over deepfake and copyright concerns.
Kling 3.0 tops text‑to‑video leaderboards with 15‑second 1080p plus 4K output and motion‑control, LTX‑2.3 can do 30‑second 480×832 clips on a single 12GB card, and NotebookLM has jumped from research summarization to generating cinematic videos and entire slide decks from your notes.
In parallel, DeepMind’s Aletheia has solved six open research‑level math problems, Confluence pushed ARC‑AGI‑2 to 97.9%, and auto‑research loops now mutate and re‑train models unsupervised in a world where marginal 'measurable execution' cost is collapsing toward zero but human verification bandwidth and institutional risk tolerance have not moved.
What This Means
The center of gravity has shifted from 'can the model do X' to 'how do we verify, govern, and afford the flood of things models are already doing,' with memory, evaluation, and security now harder bottlenecks than raw parameters or FLOPs.
On Watch
/WebMCP’s early‑preview integration into Chrome, backed by Google, Microsoft and W3C, could turn websites into first‑class tool surfaces for agents if security and monetization concerns don’t choke adoption.
/DeepSeek V4’s choice to optimize for Huawei and Cambricon while blocking Nvidia and AMD hints at a hardware stack that fractures along geopolitical lines rather than purely technical ones.
/The AI‑driven RAM crisis and 'hourly pricing' for DRAM may start killing entire product lines and smaller device makers before AGI arrives, reshaping who can practically run near‑frontier models locally.
Interesting
/DeepSeek V4 is optimized for Huawei and Cambricon chips, marking a shift away from reliance on NVIDIA hardware.
/Anthropic's Claude models for the Pentagon are reportedly 1-2 generations ahead of consumer versions, indicating a significant technological gap.
/Industrial-scale distillation attacks on AI models have been identified, involving over 24,000 fraudulent accounts.
/4% of public GitHub commits are authored by Claude Code, with projections suggesting this could exceed 20% by 2026, indicating a growing reliance on AI in coding.
/A study found that AI benchmarks neglect 92% of the US labor market, focusing primarily on coding.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Claude became the number one app on the U.S. App Store as 1.5M users left ChatGPT.
/OpenAI rolled out GPT‑5.4 with a 1M‑token context window across ChatGPT and Codex.
/Google’s Gemini 3.1 Pro and Gemini 3 Deep Think set new highs on ARC‑AGI‑2, at 77.1% and 84.6% respectively.
/DeepSeek V4 was announced as a multimodal, 1M‑context model projected to score 83.7% on SWE‑Bench.
/Zhipu AI launched GLM‑5, a 744B‑parameter open‑weight model scoring 50 on the Artificial Analysis Intelligence Index.
On Watch
/WebMCP’s early‑preview integration into Chrome, backed by Google, Microsoft and W3C, could turn websites into first‑class tool surfaces for agents if security and monetization concerns don’t choke adoption.
/DeepSeek V4’s choice to optimize for Huawei and Cambricon while blocking Nvidia and AMD hints at a hardware stack that fractures along geopolitical lines rather than purely technical ones.
/The AI‑driven RAM crisis and 'hourly pricing' for DRAM may start killing entire product lines and smaller device makers before AGI arrives, reshaping who can practically run near‑frontier models locally.
Interesting
/DeepSeek V4 is optimized for Huawei and Cambricon chips, marking a shift away from reliance on NVIDIA hardware.
/Anthropic's Claude models for the Pentagon are reportedly 1-2 generations ahead of consumer versions, indicating a significant technological gap.
/Industrial-scale distillation attacks on AI models have been identified, involving over 24,000 fraudulent accounts.
/4% of public GitHub commits are authored by Claude Code, with projections suggesting this could exceed 20% by 2026, indicating a growing reliance on AI in coding.
/A study found that AI benchmarks neglect 92% of the US labor market, focusing primarily on coding.