This month was peak dissonance: CEOs declared AGI “done” while ARC‑AGI‑3 showed frontier models at basically 0% on the one benchmark designed to catch real generalization. At the same time, Sora’s shutdown, LiteLLM’s supply‑chain compromise, and Siri’s pivot into a multi‑model router made it clear that economics, security, and orchestration layers now matter more than any one model drop.
Under the hype, compression tricks, local hardware, and a fast-maturing Chinese/open stack are quietly pulling power away from single cloud APIs toward a messy, multi-polar AI ecosystem.
Key Events
/OpenAI is shutting down Sora after incurring roughly $15M in daily operational costs.
/ARC-AGI-3 launched as an unsaturated agentic intelligence benchmark where current frontier models score below 1%.
/The ARC Prize 2026 opened with $2M in rewards tied to performance on ARC-AGI-3.
/Compromised LiteLLM releases briefly shipped credential-stealing malware to a package with 97M monthly downloads.
/TurboQuant was introduced as a KV-cache compression method that cuts memory use about 6x without accuracy loss.
Report
Jensen Huang says we’ve already hit AGI. At the same time, the new ARC‑AGI‑3 benchmark shows frontier models scoring under 1% on tasks every human can solve on first contact.
the agi we allegedly already have
Jensen isn’t alone: the person who coined “AGI” now argues we’ve achieved it as originally envisioned. ARC‑AGI‑3 is explicitly designed to test learning in novel environments rather than regurgitating training data.
On that benchmark, current frontier models sit below 1% while humans are at 100%. Meanwhile, some commenters argue that once AGI is genuinely reached, ASI will follow extremely quickly with almost no experiential buffer in between.
Others counter that today’s AGI declarations are mostly marketing spin, accusing industry leaders of redefining the term to boost stock prices and product hype.
video generation: sora’s faceplant vs workflow-native upstarts
OpenAI’s move to shut down Sora came after it reportedly burned about $15M per day in operational costs while failing to build a sticky user base.
The fallout was harsh enough that Disney walked away from a planned $1B investment deal with OpenAI. Meanwhile, Dreamina Seedance 2.0 is rolling into CapCut and ComfyUI, turning screenplays plus multimodal references into full films directly inside creator workflows.
Grok Imagine is quietly topping the DesignArena video leaderboard for editing, while Google’s SparkVSR upscaler is slated to ship for free, nudging creators toward modular stacks rather than single mega‑platforms.
AI‑generated content is projected to surpass human‑written output in 2025, and Wikipedia has already banned AI‑generated article text to protect perceived quality.
ai tooling as the new supply‑chain kill zone
The LiteLLM Python package was briefly compromised in the 1.82.x line, turning a simple `pip install litellm` into a credential‑stealing operation.
With roughly 97 million downloads per month, that wrapper suddenly became a high‑fanout malware distribution channel for SSH keys and cloud credentials.
Attackers got in via compromised CI credentials and even produced a fake SOC 2 report, undercutting the usual enterprise “vendor due diligence” narrative.
Around the same time, Aqua Security’s Trivy GitHub Action was hijacked with stolen maintainer creds, and malicious commits were force‑pushed across almost all version tags, infecting over 1,000 cloud environments.
All this lands in an ecosystem where 93% of audited AI agent frameworks still rely on unscoped API keys, and popular agents like OpenClaw routinely expose API keys, SSH, and full shell on install.
compression, not just cuda, is driving the next scaling jump
Google’s TurboQuant shows you can slash KV‑cache memory by about 6x and still keep model accuracy intact. It does this with a 3‑bit representation that requires no retraining or fine‑tuning, effectively turning long‑context support into a deployment‑time choice rather than a training‑time one.
In parallel, Memory Sparse Attention pushes context windows toward the 100‑million‑token range with only modest performance loss. Throughput work is matching that trend, with Qwen 3.5 27B reaching about 1.1 million tokens per second on a large B200 cluster under vLLM.
Yet users already report Qwen models devolving into gibberish beyond roughly 50k tokens and are turning to explicit memory layers or tiny pointer files instead of naive huge‑context RAG to avoid wasted tokens.
local stacks, chinese labs, and assistants as routers are fusing into one story
On commodity hardware, WebGPU demos now run sizable models at around 50 tokens per second directly in the browser, with no server in sight. At the hardware layer, Apple’s Mac Mini and Studio are increasingly used as local inference boxes, and Intel is preparing a high‑VRAM GPU explicitly pitched at local AI workloads.
Chinese labs are flooding that local and hybrid ecosystem: Qwen 3.5’s largest variant is widely cited as the best local coding model, while Kimi’s K2.5 runs on a MacBook Pro via SSD streaming despite being in the trillion‑parameter class.
On the UX side, Apple is turning Siri into an AI router that can dispatch queries to services like Gemini, Claude, Alexa, and Meta AI via an Extensions system and a dedicated AI section of the App Store.
Messaging platforms like Telegram and productivity tools like Basecamp are exposing agent‑oriented APIs, while OpenRouter offers a single surface to mix open and closed models, so the assistant front‑end increasingly becomes a traffic cop over many back‑ends.
What This Means
The common thread is that capability headlines—AGI, cinematic video, trillion‑parameter models—are outrunning both hard benchmarks and the security, economic, and routing layers underneath. Power is sliding from single monolithic platforms toward benchmarks like ARC‑AGI‑3, workflow‑native tools, local or regional stacks, and multi‑model assistants that sit on top of an increasingly fragile supply chain.
On Watch
/Bifrost, a Go-based alternative to LiteLLM, is gaining attention as a faster, simpler router in the wake of the LiteLLM compromise, hinting at a shift away from high-fanout Python wrappers.
/A federal judge’s decision that AI hiring tools can be legally challenged, alongside Health NZ banning ChatGPT for clinical notes, signals that sector-specific walls around powerful models may harden quickly.
/YouTube’s experiments asking users whether videos feel like “AI slop” and Wikipedia’s ban on AI-written article text suggest content provenance and quality scores are about to become visible parts of the consumer UX.
Interesting
/Epoch's confirmation of GPT-5.4 Pro solving a frontier math problem highlights its potential in advanced computational tasks.
/The Nemotron-Cascade-2-30B-A3B model from NVIDIA, with only 3B active parameters, has won multiple competitions in 2025.
/Xiaomi's MiMo-V2-Pro ranks #3 globally on agent tasks, competing closely with Anthropic.
/Local LLMs with 14B to 80B parameters may soon match Opus 4.6's performance for coding tasks.
/An open-source memory layer for AI coding agents achieved an 80% F1 score on the LoCoMo benchmark, outperforming standard RAG scores.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/OpenAI is shutting down Sora after incurring roughly $15M in daily operational costs.
/ARC-AGI-3 launched as an unsaturated agentic intelligence benchmark where current frontier models score below 1%.
/The ARC Prize 2026 opened with $2M in rewards tied to performance on ARC-AGI-3.
/Compromised LiteLLM releases briefly shipped credential-stealing malware to a package with 97M monthly downloads.
/TurboQuant was introduced as a KV-cache compression method that cuts memory use about 6x without accuracy loss.
On Watch
/Bifrost, a Go-based alternative to LiteLLM, is gaining attention as a faster, simpler router in the wake of the LiteLLM compromise, hinting at a shift away from high-fanout Python wrappers.
/A federal judge’s decision that AI hiring tools can be legally challenged, alongside Health NZ banning ChatGPT for clinical notes, signals that sector-specific walls around powerful models may harden quickly.
/YouTube’s experiments asking users whether videos feel like “AI slop” and Wikipedia’s ban on AI-written article text suggest content provenance and quality scores are about to become visible parts of the consumer UX.
Interesting
/Epoch's confirmation of GPT-5.4 Pro solving a frontier math problem highlights its potential in advanced computational tasks.
/The Nemotron-Cascade-2-30B-A3B model from NVIDIA, with only 3B active parameters, has won multiple competitions in 2025.
/Xiaomi's MiMo-V2-Pro ranks #3 globally on agent tasks, competing closely with Anthropic.
/Local LLMs with 14B to 80B parameters may soon match Opus 4.6's performance for coding tasks.
/An open-source memory layer for AI coding agents achieved an 80% F1 score on the LoCoMo benchmark, outperforming standard RAG scores.