Enterprises just discovered that throwing 100k-token prompts at everything is wildly expensive, right as DeepSeek kicks off a token price war and local/browser runtimes become genuinely useful. The biggest capability jumps this round came from meta-optimizers and orchestration, not new base models, while 'agents' quietly turned into a security and infra problem.
The interesting battle now is less about whose model is smartest and more about who can run many of them cheaply, safely, and everywhere at once.
Key Events
/DeepSeek permanently cut V4 Pro API prices by 75% to $0.435 per 1M input tokens and $0.87 per 1M output tokens, about 11.5x cheaper than GPT‑5.5.
/DeepSeek is raising $10.29B to scale open-source AI models while China restricts overseas travel for its AI talent.
/Microsoft began canceling internal Claude Code licenses and reportedly halted AGI projects because token-based costs became unsustainable.
/Anthropic agreed to pay SpaceX about $1.25B per month for AI compute capacity starting in mid‑2026.
/An OpenAI model autonomously solved an Erdős discrete-geometry problem, disproving a central conjecture and finding a new, better family of constructions.
Report
Everyone is arguing about AGI timelines while the first large-scale experiment in 'AI eats software engineering' is quietly failing its cost-benefit test.
At the same time, tokens are being commoditized by players like DeepSeek and meta-optimizers are squeezing 3x performance gains out of existing models without new base weights.
the tokenmaxxing crash
Quarterly token volume is up ~17,000x in four years while prices collapsed, creating a culture of tokenmaxxing as a proxy for progress.
Now the bill has arrived: Microsoft is canceling internal Claude Code licenses, explicitly citing unsustainable token-based costs and even halting AGI projects.
Uber’s COO reports no measurable productivity gains from AI despite burning through the budget early, and says AI tools are pricier than engineers.
Salesforce still plans to spend around $300M on Anthropic tokens this year with AI doing only 30–50% of its workload, while some startups blow $1.3M a month on tokens.
Median coding-agent requests already stuff in 96k tokens—longer than The Great Gatsby—so the default workflow is literally to overcontext everything.
deepseek and the commoditization of tokens
Into that mess walks DeepSeek V4 Pro, permanently cutting prices by 75% to about $0.435/M input and $0.87/M output tokens—roughly 11.5× cheaper than GPT‑5.5.
DeepSeek is simultaneously raising $10.29B specifically to scale open-source-style models rather than a closed SaaS platform, a very different capital story from OpenAI.
China is now restricting overseas travel for DeepSeek and other AI talent, effectively treating its model weights and people as strategic assets.
OpenRouter and similar brokers are already routing huge volumes—25T tokens/week—through cheaper models like DeepSeek after its cuts, making per-token price a first-class competitive lever.
The background hum in forums is that AI vendors aren’t trustworthy and token pricing is opaque, which makes a permanently cheap, open-leaning player look less like a discount and more like a wedge.
meta-optimizers are the real 'new models'
A single PyTorch-based 'universal optimizer' almost tripled Gemini Flash’s ARC‑AGI score from 32.5% to 89.5% while cutting cloud costs by ~40% across six tasks, without a new base model.
On the code side, GPT‑5.5 hits 70% on the new DeepSWE benchmark, which involves editing ~668 lines across seven files per task—real engineering, not toy LeetCode.
SWEBench Pro now looks artificially harsh for GPT‑5.5 because 68.5% of its 'failures' came from broken tests, implying an effective score closer to 86.7%.
Meanwhile an OpenAI general-purpose model just disproved a long-standing Erdős conjecture in discrete geometry and discovered a new, better family of constructions, a qualitative shift from 'autocomplete for math proofs.' The pattern is that orchestration, evaluation, and targeted fine-tuning are driving the biggest step-changes, while the marketing still talks like it’s all about bigger base models.
local-first is no longer cosplay
On the hardware fringe, AMD + Vulkan is quietly becoming a serious LLM platform: users report ~20% speedups over ROCm and RX 7900 cards outpacing Nvidia 3090s for local inference.
A dual-RTX 3060 setup is decoding Qwen 3.6‑27B at 30–50 tokens/s, while a single 3090 can push Qwen 3.6 27B to ~164 tokens/s with the right configuration. llama.cpp keeps squeezing more from this hardware, with Multi-Token Prediction updates yielding up to 7× faster generation on Qwen 27B-class models and BeeLlama hitting ~178 tokens/s on a 3090.
In parallel, WebGPU is turning browsers into runtime environments: PrismML’s 3GB Bonsai Image 4B text-to-image models run fully client-side with 1‑bit/ternary weights, and Local Ghost serves Qwen2.5 offline in-browser.
All this is happening against a backdrop where GPUs are still painfully expensive and many users on lower-end hardware treat lightweight quantized models as the only way open source is actually usable.
agents are now a security problem with a UI
The most interesting agent work right now reads like security engineering, not UX: Runtime’s YC-funded sandboxed coding agents, Gemini Managed Agents’ Linux sandbox, and Edge.js’s Node-in-WASM all treat code execution as the primitive.
Protocol-wise, MCP is standardizing how models talk to tools and data across 10,000+ servers, now moving to a stateless design while already showing that ~15.3% of scanned public servers are vulnerable enough for NSA warnings.
RAG pipelines still fail mainly on retrieval—about 60% of breakdowns—yet those same pipelines are being wired directly into agent toolchains.
The surrounding software supply chain is porous: a poisoned VS Code extension exfiltrated ~3,800 private GitHub repos, the Megalodon campaign hit 5.5k more, and a Hugging Face dataset stayed poisoned for six months.
In other words, 'autonomous agent' increasingly means 'scriptable front-end on your production environment,' with an attack surface expanding faster than most security teams are staffed.
What This Means
The center of gravity is drifting away from 'one big frontier model' toward a messy stack of cheaper tokens, local runtimes, aggressive optimizers, and heavily sandboxed agents. The interesting part is that the real constraint now looks less like 'can models do it?' and more like 'can anyone afford, secure, and orchestrate them at scale?'
On Watch
/Open document-understanding is heating up fast: the 4B-parameter NuExtract3 VLM (Apache‑2.0) plus Unsiloed Parser v3.1’s 88.0 score on olmOCR‑Bench suggest PDFs and invoices might soon be effectively 'solved' for agents.
/Heretic can strip guardrails from Meta’s Llama 3.3 in under 10 minutes, spawning 3,500+ decensored variants, which is a volatile mix with MCP-connected agents and already-leaky supply chains.
/State-backed compute bets—Anthropic’s $1.25B/month SpaceX contract and the U.S. DoC’s $2B quantum-equity program—hint that core AI infrastructure may start to resemble regulated utilities more than ordinary cloud products.
Interesting
/DeepMind's research has enabled LLMs to solve nine open Erdős problems and prove 44 OEIS conjectures through formal proofs.
/The ARC Prize 2026 competition saw Tufa Labs achieve a high score of only 1.17%, highlighting the challenges in AGI development.
/A 6-person team is developing task-specific AI models that are reported to be 4-8 times faster than existing models from OpenAI or Anthropic.
/The Behavioral Credibility Trilemma indicates that no reinforcement learning policy can achieve maximum helpfulness, optimal calibration, and full autonomy simultaneously under certain conditions.
/Open-source LLMs are facing challenges with long reasoning jailbreaks, indicating vulnerabilities even with defenses in place.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/DeepSeek permanently cut V4 Pro API prices by 75% to $0.435 per 1M input tokens and $0.87 per 1M output tokens, about 11.5x cheaper than GPT‑5.5.
/DeepSeek is raising $10.29B to scale open-source AI models while China restricts overseas travel for its AI talent.
/Microsoft began canceling internal Claude Code licenses and reportedly halted AGI projects because token-based costs became unsustainable.
/Anthropic agreed to pay SpaceX about $1.25B per month for AI compute capacity starting in mid‑2026.
/An OpenAI model autonomously solved an Erdős discrete-geometry problem, disproving a central conjecture and finding a new, better family of constructions.
On Watch
/Open document-understanding is heating up fast: the 4B-parameter NuExtract3 VLM (Apache‑2.0) plus Unsiloed Parser v3.1’s 88.0 score on olmOCR‑Bench suggest PDFs and invoices might soon be effectively 'solved' for agents.
/Heretic can strip guardrails from Meta’s Llama 3.3 in under 10 minutes, spawning 3,500+ decensored variants, which is a volatile mix with MCP-connected agents and already-leaky supply chains.
/State-backed compute bets—Anthropic’s $1.25B/month SpaceX contract and the U.S. DoC’s $2B quantum-equity program—hint that core AI infrastructure may start to resemble regulated utilities more than ordinary cloud products.
Interesting
/DeepMind's research has enabled LLMs to solve nine open Erdős problems and prove 44 OEIS conjectures through formal proofs.
/The ARC Prize 2026 competition saw Tufa Labs achieve a high score of only 1.17%, highlighting the challenges in AGI development.
/A 6-person team is developing task-specific AI models that are reported to be 4-8 times faster than existing models from OpenAI or Anthropic.
/The Behavioral Credibility Trilemma indicates that no reinforcement learning policy can achieve maximum helpfulness, optimal calibration, and full autonomy simultaneously under certain conditions.
/Open-source LLMs are facing challenges with long reasoning jailbreaks, indicating vulnerabilities even with defenses in place.