Open and local models like DeepSeek R2 and Qwen 3.6 are now flirting with GPT-4-level performance, while efficiency tricks like Token Superposition Training and NVFP4 are compressing the cost and time between generations. At the same time, AI-first coding and agent stacks are dumping huge amounts of brittle, insecure automation into codebases just as Mythos-class models start to automate serious cyberattacks.
The real choke point isn’t raw model IQ anymore, it’s the security, infra, and memory systems we’re bolting around these models.
Key Events
/DeepSeek R2 was open-sourced and now matches GPT-4o on 9 of 12 benchmarks, including strong coding performance.
/Anthropic's Mythos Preview became the first model to solve the UK AI Security Institute's cyber ranges end-to-end and was used to create a public Apple M5 kernel exploit.
/A mass npm supply-chain attack compromised over 170 packages, including TanStack and Mistral AI, via GitHub Actions cache poisoning and the 'mini Shai-Hulud' malware.
/Senators Sanders and Ocasio-Cortez introduced a bill to pause AI data center construction, intersecting with more than 300 related local bills.
/Hermes Agent became the most popular AI agent project on GitHub, surpassing 140,000 stars in under three months.
Report
The strangest signal this month is that open and local models are quietly landing GPT-4-class scores while a niche cyber model like Mythos learns to tear through real systems, far from the usual GPT-5.x headline war.
Underneath that, efficiency hacks, agent stacks, and some very human plumbing problems around security and memory are starting to matter as much as raw model IQ.
ess models at the frontier, with governance lagging
DeepSeek R2 is open-source and matches GPT-4o on 9 of 12 benchmarks, putting it effectively side-by-side with a leading closed model on many public leaderboards.
On HumanEval coding tasks it scores 93.2, which is firmly in GPT-4-class territory for programming.Qwen 3.6 35B running locally has generated a complete playable game from scratch in under 17 minutes, without external fixes.
With Multi-Token Prediction on tuned hardware, the 27B variant has been pushed to high token-throughput and has beaten Gemma 4 on tool-calling reliability in automation tests.
Kimi K2.6 is a large mixture-of-experts model that activates about 32 billion parameters per token and now tops OpenRouter’s programming leaderboard by weekly usage, even as CAISI reports that open models overall are sliding behind American frontier systems on long-horizon reasoning tests.
efficiency hacks and token economics are bending the curve
Token Superposition Training changes the pretraining loop so each position can carry multiple tokens, with early reports of roughly two-to-three-times faster pretraining without changing architectures.
For local inference, Multi-Token Prediction in llama.cpp lets a Qwen 27B model jump from single-digit speeds to more than 16 tokens per second on the same hardware, with acceptance rates around ninety percent on Qwen.
NVIDIA’s NVFP4 format has already been used to train a 12-billion-parameter language model using that low-precision scheme and to support far larger systems like the 120-billion-parameter Nemotron 3 family.
On the spend side, the creator of OpenClaw reports burning about 1.3 million dollars on OpenAI tokens in roughly a month, showing how quickly heavy usage can turn into a compute bill problem.
Pinecone’s Nexus layer claiming up to ninety percent token reductions and a UK finding that capability for models like GPT-5.5 doubles roughly every 4.5 months both turn cost and capacity into moving targets rather than fixed ceilings.
agents and ai-first coding: the new os is brittle
Airbnb reports that roughly sixty percent of its new code is written by AI tools. Google says models now generate about three quarters of its new code and Microsoft reports a share around thirty percent, while Mistral’s founder claims engineers there no longer write code themselves.
Agentic stacks are forming around that reality: Zerostack’s Unix-inspired Rust agent, xAI’s Grok Build with subagents and skills, Claude Code assembling context from multiple sources, Hermes Agent’s three-tier memory, and graph frameworks like LangGraph all assume teams will orchestrate swarms of tools and models rather than call a single API.
Early adopters report median productivity gains around seventy-one percent for companies using agentic AI, and GitHub is piloting Copilot as a standalone app while some firms literally mandate daily use with leaderboards, even as bots that listen to meetings and auto-open pull requests appear.
Security and quality signals are flashing at the same time, with scanners finding vulnerabilities in ninety percent of public GitHub repos built with certain tools, AI-first teams treating PRs as rubber stamps, and data-science workflows where AI-generated analyses are often wrong even as humans move into reviewer-only roles.
mythos and the first real ai cyber inflection
Anthropic’s Mythos Preview has already been used to find a Curl bug and a FreeBSD vulnerability and to create what appears to be the first public macOS kernel memory-corruption exploit on Apple’s M-series systems, while OpenAI’s Daybreak and Google’s confirmation of an AI-crafted zero-day exploit show the same pattern on defensive and attacker tooling.
Elite researchers report that with Mythos they produced that Mac exploit in five days, bypassing an Apple Memory Integrity Enforcement project that reportedly took five years and billions of dollars.
In UK AI Security Institute tests, Mythos successfully completed a 32-step corporate network attack scenario. Across repeated runs it succeeded in six of ten attempts on that range, which the institute cites in calling it the first model to solve their cyber challenges end-to-end.
In separate evaluations it also pulled off 18 of 41 n-day exploits, a hit rate that puts it well beyond earlier models in automating real-world vulnerability chains.
rag, memory, and knowledge plumbing as the real bottleneck
While everyone argues about which base model is smarter, most real-world Retrieval-Augmented Generation systems are still confidently wrong much of the time, with stale repository snippets, document heterogeneity rot, and bad chunking making answers diverge from ground truth even when the LLM is capable.
Developers report that plain grep can beat semantic search for many agentic workflows, and lightweight RAG bots over PyTorch and Hugging Face docs show quality that depends more on retrieval hygiene than on which frontier model is used.
Toolmakers are responding with correctness-aware context hygiene frameworks, δ-mem for online memory, and systems like GBrain that store knowledge as markdown files instead of embeddings, while the Agent Memory Protocol tries to standardize how agents share and persist long-term state.
On the personal-knowledge side, Obsidian-plus-LLM workflows and local Qwen nodes used as private notebooks show that many power users care more about controllable, debuggable memory than about squeezing out another benchmark point.
Long-term memory remains a weak spot for current LLMs, with reports of stale facts and degradation over long sessions even as experiments like Emergence World show agents in simulated towns writing and breaking laws over days at a time.
What This Means
Model capability and cost curves are now outrunning the security, governance, and knowledge plumbing wrapped around them, so the interesting frontier is shifting from "how smart is the model" to "what kind of brittle software, infra, and institutions we are wiring it into." The consensus that the next big story is just "GPT-5 arrives" misses that OSS, agents, cyber capabilities, and memory systems are already reshaping the landscape underneath that headline.
On Watch
/The Sanders–AOC push to pause AI data center construction, combined with local backlash over a facility draining tens of millions of gallons of water, is turning infrastructure footprint and water usage into a prime policy lever on AI scaling.
/On-device multimodal assistants like NeuralCompanion, Supertonic, and OmniVoice show that fully local LLM+TTS+STT stacks with sub-200ms latency and hundreds of languages are now practical, which could quietly pull assistant workloads off the cloud.
/LangChain, LangGraph, and SmithDB-style tooling are racing to add observability, policy layers, and EU AI Act-friendly audit trails for agents and RAG, suggesting that "agent governance" could solidify into its own mini-stack.
Interesting
/DeepSeek V4 Flash's 210B model performed comparably to models four times its size in benchmarks, showcasing its efficiency.
/In three weeks, ml-intern exchanged 1 million messages, equating to 3.3 agent-years of ML research, demonstrating the model's extensive usage.
/The Gemini Pro model is rumored to be a GPT 5.5 level coding model, priced at $12 per million output tokens, making it more cost-effective than its competitors.
/A user claims that GPT-5.5 outperforms Mythos in cybersecurity tasks, raising questions about competitive capabilities.
/Researchers from the Max Planck Institute's FutureSim environment allows models to predict future events, with GPT 5.5 outperforming human aggregates.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/DeepSeek R2 was open-sourced and now matches GPT-4o on 9 of 12 benchmarks, including strong coding performance.
/Anthropic's Mythos Preview became the first model to solve the UK AI Security Institute's cyber ranges end-to-end and was used to create a public Apple M5 kernel exploit.
/A mass npm supply-chain attack compromised over 170 packages, including TanStack and Mistral AI, via GitHub Actions cache poisoning and the 'mini Shai-Hulud' malware.
/Senators Sanders and Ocasio-Cortez introduced a bill to pause AI data center construction, intersecting with more than 300 related local bills.
/Hermes Agent became the most popular AI agent project on GitHub, surpassing 140,000 stars in under three months.
On Watch
/The Sanders–AOC push to pause AI data center construction, combined with local backlash over a facility draining tens of millions of gallons of water, is turning infrastructure footprint and water usage into a prime policy lever on AI scaling.
/On-device multimodal assistants like NeuralCompanion, Supertonic, and OmniVoice show that fully local LLM+TTS+STT stacks with sub-200ms latency and hundreds of languages are now practical, which could quietly pull assistant workloads off the cloud.
/LangChain, LangGraph, and SmithDB-style tooling are racing to add observability, policy layers, and EU AI Act-friendly audit trails for agents and RAG, suggesting that "agent governance" could solidify into its own mini-stack.
Interesting
/DeepSeek V4 Flash's 210B model performed comparably to models four times its size in benchmarks, showcasing its efficiency.
/In three weeks, ml-intern exchanged 1 million messages, equating to 3.3 agent-years of ML research, demonstrating the model's extensive usage.
/The Gemini Pro model is rumored to be a GPT 5.5 level coding model, priced at $12 per million output tokens, making it more cost-effective than its competitors.
/A user claims that GPT-5.5 outperforms Mythos in cybersecurity tasks, raising questions about competitive capabilities.
/Researchers from the Max Planck Institute's FutureSim environment allows models to predict future events, with GPT 5.5 outperforming human aggregates.