Benchmarks like ARC‑AGI‑3 say today’s LLMs are nowhere near general intelligence even as executives declare AGI basically done, while the real action is shifting to Chinese/open models and increasingly serious local inference stacks. At the same time, Anthropic’s Mythos looks like a genuine exploit-finding monster just as its Claude Code ecosystem leaks and throttles, and the AI tooling supply chain (LiteLLM, Trivy, telnyx) got hit with sophisticated compromises.
The frontier now looks less like one lab’s roadmap and more like a messy, multi-polar arms race across models, infra, and security holes.
Key Events
/Claude Code leaked in full, exposing 512k lines of source via an npm map file, and Anthropic issued takedowns for 8,000+ copies.
/Claude Mythos hit 93.9% on SWE-bench Verified and uncovered thousands of zero-days, including a 27-year OpenBSD bug.
/Google released Gemma 4 under Apache 2.0 with 31B/26B variants, 256K context, and over 10M downloads in week one.
/A supply-chain attack backdoored LiteLLM and Trivy, exfiltrating SSH keys and cloud credentials from tens of thousands of environments.
/OpenAI shut down Sora after costs hit about $1M/day, and Disney exited a planned $1B partnership.
Report
ARC‑AGI‑3 just told us frontier models are basically toddlers at interactive abstraction, even as NVIDIA and OpenAI executives publicly declare AGI here or imminent.
Meanwhile the strongest progress signal isn’t another closed US model, it’s the mix of Chinese/open stacks like Qwen and GLM‑5.1, Google’s Gemma 4, and brutal infra hacks like TurboQuant and vLLM quietly rewriting who actually controls capability and cost.
the agi benchmark crash‑test
ARC‑AGI‑3 is 135 interactive reasoning environments where humans score 100% but frontier models max out around 0.37%, far below even basic competence.
The benchmark was explicitly designed to test how systems learn new tasks rather than regurgitate training data, after earlier ARC versions were partially ‘cheated’ by memorization.
Seed IQ then landed 95% Relative Human Action Efficiency on ARC‑AGI‑3 at launch—taking ~1.026 actions per human action—showing specialized systems can nail the test even as general LLMs flail.
Yet NVIDIA’s Jensen Huang is already claiming AGI is achieved, OpenAI’s president says we’re 70–80% there within a couple of years, and major forecasters cluster around 2027–28 AGI timelines despite this gap.
That optimism coexists with projections of 9.3M US jobs displaced within 2–5 years and half of entry‑level white‑collar roles evaporating, even while today’s models can’t consistently solve ARC‑style puzzles.
the new frontier map: china + open models
On coding and agents, the sharpest blades now come from Chinese and open labs: GLM‑5.1 scores 58.4 on SWE‑Bench Pro, beating Opus 4.6 and GPT‑5.4 while shipping as MIT‑licensed open weights.
MiniMax M2.7 matches GPT‑5.4 and Opus 4.6 on core tasks at roughly 20× lower cost and 2–4× higher speed, while also hitting 91% on MMLU.
Alibaba’s Qwen‑3.6‑Plus just became the first model on OpenRouter to chew through over 1 trillion tokens in a single day and it already outperforms Claude Opus on key coding benchmarks like SWE‑Bench and terminal‑bench.
Gemma 4, Google’s Apache‑licensed family, adds a Western open pole with 31B and 26B variants that run locally, handle 256K context, and beat GPT‑5.2/Qwen‑3.5 on several benchmarks despite being 10× smaller than some peers.
Stack these with DeepSeek V3 matching Claude Sonnet on most coding tasks and Xiaomi’s MiMo‑V2‑Pro ranking #3 globally on agent benchmarks at a fraction of Opus pricing, and ‘frontier’ stops being synonymous with US closed models.
local inference’s quiet land grab
TurboQuant and related tricks are turning laptops into credible inference servers: Google’s algorithm cuts KV‑cache memory by at least 6× and can yield up to 8× speedups while preserving accuracy.
Users are already running 100K‑token conversations on MacBooks and 72K‑context Llama‑70B on dual consumer GPUs purely via KV‑cache compression techniques.
Gemma 4 runs multimodal 31B models on machines with as little as 6GB RAM and even on iPhone‑class devices at ~40 tokens/second, while MiniMax M2.7 and Qwen 3.5 variants hit >90% MMLU locally on Apple Silicon.
Frameworks like MLX, Lemonade, Ensu, GAIA, Ollama, and vLLM now wrap this into one‑click local servers, even as heavy users complain about burning $2K/month on Claude tokens and agents silently racking up hundreds of dollars in API calls.
The result is a bifurcation where labs like Qwen push to 1T tokens/day in the cloud while a growing fraction of serious workloads quietly move to hybrid or fully local stacks to escape pure API economics.
anthropic: mythos on offense, claude code on fire
Claude Mythos is the first model that genuinely looks ‘new’ for software security: it scores 93.9% on SWE‑Bench Verified, finds zero‑days 100× more often than Opus 4.6, and unearthed a 27‑year‑old OpenBSD bug.
Security researchers say they’ve found more bugs with Mythos in weeks than in their entire careers, and Goldman Sachs plus the US Treasury are lining up to use it for defensive scanning.
At the same time, Anthropic accidentally leaked the entire 512,000‑line Claude Code CLI via an npm map file, then fired off over 8,000 copyright takedowns while AMD’s AI director publicly called Claude ‘regressed’ and untrustworthy for complex engineering.
Claude Code users report hitting usage limits far faster than expected, getting locked out for hours, and losing >99% uptime just as Anthropic bans OpenClaw access and moves those workflows onto higher‑priced per‑task models.
Mythos plus Project Glasswing and Palantir’s Maven AI becoming a $13B ‘program of record’ show AI now central to cyber offense/defense, but Anthropic’s own leak and throttling underline how brittle the surrounding tooling still is.
ai’s supply chain is now the primary attack surface
TeamPCP’s campaign against the AI tooling stack is a full dress rehearsal for LLM‑era supply‑chain attacks. After compromising Trivy’s CI pipeline, attackers pushed a tainted 0.69.4 release that in turn shipped backdoored LiteLLM versions 1.82.7/8 to PyPI, exfiltrating SSH keys, AWS credentials, and Kubernetes secrets from an estimated 47,000 users and over 1,000 cloud environments.
The same crew slipped steganographic malware into telnyx 4.87.1/4.87.2 via WAV files, again harvesting credentials from a package downloaded roughly 30,000 times a day.
LiteLLM alone had ~97M monthly downloads and sits underneath higher‑level agent frameworks like DSPy and CrewAI, so a 40‑minute compromise cascaded into hundreds of thousands of machines across 36 cloud environments.
What previously looked like ‘just another LLM proxy’ or ‘just a scanner’ is now functionally operating as critical security infrastructure, because a single poisoned dependency can end up owning agents, CI, and production keys in one move.
What This Means
The capability frontier is fragmenting and drifting toward open, Chinese, and local stacks just as the security and benchmark reality is loudly contradicting AGI marketing. The next few years look less like a single AGI race and more like an uneven arms race between infra‑obsessed labs, brittle agent/tool ecosystems, and attackers who now understand exactly where to poke.
On Watch
/Palantir’s Maven AI is now a formal Pentagon program of record with funding targeting $13B and is moving from target ID to strategy and strike assistance, pushing AI much deeper into the command chain.
/Meta’s Muse Spark debuts as a closed personal‑superintelligence model with multimodal reasoning and native multi‑agent orchestration, ranking just behind Gemini 3.1 Pro and GPT‑5.4 on the AA Index and hinting at ‘agents as product’ rather than tools.
/The MCP ecosystem is exploding in tools and downloads, but early data show 28% timeout failure rates and 32× higher token costs than equivalent CLIs, plus almost no guidance in 98% of tool descriptions—a fragile foundation for large‑scale agents.
Interesting
/Zhipu AI's GLM-OCR model achieved a remarkable score of 94.62 on OmniDocBench V1.5 with only 0.9B parameters, highlighting efficiency in model design.
/The open-source community has developed models that can produce similar quality video outputs as Sora, indicating a shift towards more accessible local generation capabilities.
/WebGPU in browsers has outperformed PyTorch on datacenter GPUs in certain benchmarks, indicating potential for browser-based solutions.
/The Kimi K2.5 model, with 1 trillion parameters, runs locally on a MacBook Pro with M2 Max and 96GB memory.
/FlexTensor allows running large models like Llama-3.1-405B on a single 180GB GPU by utilizing host RAM as an extension of GPU memory.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Claude Code leaked in full, exposing 512k lines of source via an npm map file, and Anthropic issued takedowns for 8,000+ copies.
/Claude Mythos hit 93.9% on SWE-bench Verified and uncovered thousands of zero-days, including a 27-year OpenBSD bug.
/Google released Gemma 4 under Apache 2.0 with 31B/26B variants, 256K context, and over 10M downloads in week one.
/A supply-chain attack backdoored LiteLLM and Trivy, exfiltrating SSH keys and cloud credentials from tens of thousands of environments.
/OpenAI shut down Sora after costs hit about $1M/day, and Disney exited a planned $1B partnership.
On Watch
/Palantir’s Maven AI is now a formal Pentagon program of record with funding targeting $13B and is moving from target ID to strategy and strike assistance, pushing AI much deeper into the command chain.
/Meta’s Muse Spark debuts as a closed personal‑superintelligence model with multimodal reasoning and native multi‑agent orchestration, ranking just behind Gemini 3.1 Pro and GPT‑5.4 on the AA Index and hinting at ‘agents as product’ rather than tools.
/The MCP ecosystem is exploding in tools and downloads, but early data show 28% timeout failure rates and 32× higher token costs than equivalent CLIs, plus almost no guidance in 98% of tool descriptions—a fragile foundation for large‑scale agents.
Interesting
/Zhipu AI's GLM-OCR model achieved a remarkable score of 94.62 on OmniDocBench V1.5 with only 0.9B parameters, highlighting efficiency in model design.
/The open-source community has developed models that can produce similar quality video outputs as Sora, indicating a shift towards more accessible local generation capabilities.
/WebGPU in browsers has outperformed PyTorch on datacenter GPUs in certain benchmarks, indicating potential for browser-based solutions.
/The Kimi K2.5 model, with 1 trillion parameters, runs locally on a MacBook Pro with M2 Max and 96GB memory.
/FlexTensor allows running large models like Llama-3.1-405B on a single 180GB GPU by utilizing host RAM as an extension of GPU memory.