Small open models running on cheap or old hardware quietly crossed the “good enough” line, especially when paired with smart RAG and agent design, so raw model size mattered less than systems engineering. At the same time, token usage is exploding faster than prices are falling, making AI bills and jailbreak/safety issues the real chokepoints just as Copilot’s default status in the assistant wars starts to crack.
The interesting story isn’t AGI; it’s that AI is turning into leaky, expensive infrastructure that everyone is already hooked on.
Key Events
/Heretic removed guardrails from Llama 3.3 in under 10 minutes, spawning over 3,500 decensored variants.
/Qwen 3.6 hit about 1600 tps with 64-way concurrency on dual RTX PRO 6000 GPUs using vLLM.
/Token usage grew roughly 17,000× in four years even as token prices fell, driving AI bills sharply higher.
/Nvidia introduced a Pixel Diffusion Decoder and ComfyUI node to replace traditional VAE/RAE decoders for high-res image generation.
/Microsoft Copilot faced file-exfiltration accusations, price backlash, and head-to-head quality comparisons now favoring Gemini and Claude.
Report
The loudest noise is still AGI takes, but the interesting move this month is dumber: tiny open models on aging GPUs quietly replacing frontier APIs for real workloads.
At the same time, token usage is up roughly 17,000× while token prices fall, so AI bills are exploding even as infra itself gets cheaper.
local is quietly eating frontier
On a dual RTX PRO 6000 rig, Qwen 3.6 clocks around 1600 tokens per second under vLLM, even at high concurrency. Users report that this open model now carries significant coding load in editors like VSCodium, materially cutting their manual work.
MiniCPM5‑1B, fully open and only a billion parameters, outperforms larger peers on several tasks and is light enough to run on mobile devices. Developers are also keeping older GPUs alive with llama.cpp optimizations, making aging cards surprisingly viable for serious local inference.
Combined with falling 3090 prices and a wafer-scale Cerebras system that folds an NVL72 rack onto one chip, the cost curve is tilting toward owning compute rather than renting it.
ai is not cheap, it's just addictive
Uber’s COO describes tokenmaxxing as increasingly hard to justify, even as the number of tokens processed has grown about 17,000× in four years.
Demand for “machine intelligence” is so elastic that lower token prices simply drive much higher usage, pushing AI bills up instead of down. CFOs are now explicitly discussing how to forecast and buffer these costs, which were originally sold as straightforward efficiency gains.
Internally, Microsoft and others report that some AI deployments are already more expensive than human labor, undercutting the early automation-arbitrage narrative.
When an email agent team moved from polling to event-driven wakeups and cut downstream tokens by 91%, it showed that the real cost lever is systems design, not per-token pricing.
rag and agents are beating brute-force context
One team tried to kill their RAG stack after getting a 1M-context model, only to reinstate RAG two weeks later when complex queries began failing.
Field reports now converge on hybrid retrieval—BM25 plus vectors, reranking, and query rewriting—as more reliable for multi-hop questions than dumping everything into a giant context window.
Filtering low-score chunks before answering measurably cuts hallucinations, sometimes more than swapping to a supposedly stronger base model.
Teams operating over 10 million documents describe fine-tuning as optional compared with getting retrieval freshness, indexing cadence, and schema boundaries right.
The same pattern shows up in agents, where an event-driven design slashes token usage and the hardest debugging task is reconstructing an agent’s beliefs rather than fixing its code.
the jailbreak/oss double helix
Heretic can strip guardrails from Llama 3.3 in under ten minutes, and users have already spawned more than 3,500 decensored variants. In parallel, Cryptex‑OSS ships a browser-based jailbreaking kit loaded with text transforms and attack seeds, turning red-teaming into a point-and-click workflow.
On the “legit” OSS side, frameworks like Mastra and spec-driven tools like Aigon are recreating proprietary agent platforms with open components, while Dlmserve brings diffusion-language-model serving to an RTX 5070-class GPU.
Profitable products like Cursor are being built directly on this OSS ecosystem, showing there is real revenue in open stacks, not just hobbyist experimentation.
But leaks of random user chats from DeepSeek and explicit PII/PHI risk concerns around OpenClaw demonstrate how the same openness widens the security and compliance attack surface.
platform wars: no obvious heir to copilot
Microsoft is pushing Copilot hard across 365, but users accuse it of quietly exfiltrating files and are pushing back on price hikes.
Inside Microsoft, many employees reportedly prefer Claude for effectiveness despite higher costs, while outside benchmarks increasingly rate Gemini ahead of Copilot on quality.
Codex’s 5.3 release is strong enough that some developers have cancelled Claude, while Cursor often beats Codex on complex projects thanks to deeper agentic workflows.
At the same time, Antigravity 2.0’s sluggish web-design capabilities and frequent rate limits are driving power users away even on paid plans. xAI’s Grok adds expert mode and fast X search, with a 1.5T foundation model finished and a 0.5T open-sourced version promised, yet the community still thinks it trails leading labs on quality.
What This Means
Underneath the AGI discourse, the stack is reorganizing around small open models, retrieval-heavy systems, and OSS infrastructure, while token economics and jailbreakable safety make the old “cheap, centralized, aligned AI” storyline increasingly fragile. The fact that 99% of executives expect AI-driven layoffs at the same time UC Berkeley Law moves to ban AI in grading is the social mirror of that shift—rapid industrialization colliding with brittle cost models, accountability gaps, and governance.
On Watch
/Princeton’s Conifer project is building a new local inference runtime for Apple Silicon, and if its promised performance gains land, it could reset expectations for laptop-class LLM serving.
/Nvidia’s Pixel Diffusion Decoder and its ComfyUI integration are early tests of post-VAE image decoders, and community adoption will signal where high-res generative image architectures head next.
/UC Berkeley Law’s plan to ban AI for graded assignments by 2026 is an early institutional pushback that could spread across education and professional credentialing.
Interesting
/- NuExtract3 can auto-generate extraction templates from natural language, enhancing user efficiency in document processing.
/- Agyn provides full credential isolation for AI agents, enhancing security in production environments.
/- AI agents are making around 1000 MCP requests daily to monitor AI API pricing and status.
/- Hyundai's Atlas robot training will utilize football videos to enhance its learning capabilities.
/- The AI community is pushing for standardized evaluation metrics to ensure transparency in model training.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Heretic removed guardrails from Llama 3.3 in under 10 minutes, spawning over 3,500 decensored variants.
/Qwen 3.6 hit about 1600 tps with 64-way concurrency on dual RTX PRO 6000 GPUs using vLLM.
/Token usage grew roughly 17,000× in four years even as token prices fell, driving AI bills sharply higher.
/Nvidia introduced a Pixel Diffusion Decoder and ComfyUI node to replace traditional VAE/RAE decoders for high-res image generation.
/Microsoft Copilot faced file-exfiltration accusations, price backlash, and head-to-head quality comparisons now favoring Gemini and Claude.
On Watch
/Princeton’s Conifer project is building a new local inference runtime for Apple Silicon, and if its promised performance gains land, it could reset expectations for laptop-class LLM serving.
/Nvidia’s Pixel Diffusion Decoder and its ComfyUI integration are early tests of post-VAE image decoders, and community adoption will signal where high-res generative image architectures head next.
/UC Berkeley Law’s plan to ban AI for graded assignments by 2026 is an early institutional pushback that could spread across education and professional credentialing.
Interesting
/- NuExtract3 can auto-generate extraction templates from natural language, enhancing user efficiency in document processing.
/- Agyn provides full credential isolation for AI agents, enhancing security in production environments.
/- AI agents are making around 1000 MCP requests daily to monitor AI API pricing and status.
/- Hyundai's Atlas robot training will utilize football videos to enhance its learning capabilities.
/- The AI community is pushing for standardized evaluation metrics to ensure transparency in model training.