Labs are quietly running Mythos-class models that crush SWE-Bench and dig up ancient security bugs, while public coding assistants like Claude Code are regressing and locking people out. At the same time, open-weight coders such as GLM-5.1 and Qwen 3.5 plus serious local stacks are now good enough to power real agents, pushing the interesting questions into RAG 2.0, security agents, and the brittle MCP/graph/memory plumbing underneath.
The most writable stories sit exactly where benchmark hype meets flaky tooling and newly dangerous capabilities.
Key Events
/Claude Mythos Preview found thousands of zero-day vulnerabilities across major OSes and browsers, including a 27-year OpenBSD bug, and will only be available to billion-dollar companies, governments, and select partners.
/GLM-5.1 became the #1 open-source and #3 overall model on SWE-Bench Pro with a 58.4 score, and topped the Vals Index and GDPval-AA rankings.
/Meta Superintelligence Labs launched Muse Spark, a natively multimodal, tool-using, multi-agent model that ranks just behind Gemini 3.1 Pro and GPT-5.4 on the Artificial Analysis Intelligence Index.
/The Model Context Protocol (MCP) surpassed 97 million monthly SDK downloads and 177,000 registered tools, becoming the de facto standard for connecting agents to external systems.
/Alibaba’s HappyHorse 1.0 open-source text-to-video model reached #1 in the Artificial Analysis Video Arena for Text and Image to Video (No Audio), licensed under Apache 2.0.
Report
Public coding assistants are getting flakier right as labs quietly demo Mythos-class models that dig up decades-old zero-days and outscore earlier Claudes on SWE-Bench-style tests.
At the same time, open-weight coders like GLM-5.1 and Qwen 3.5 are suddenly strong enough locally that 'which stack powers my agents?' is a live architectural question, not a foregone conclusion.
the hidden frontier vs tired public coders
Anthropic’s Claude Mythos Preview is quietly operating in a different universe from public Claude: it finds thousands of zero-day vulns across major OSes and browsers, including a 27-year OpenBSD bug and a 16-year FFmpeg flaw, and beats Opus 4.6 on SWE-Bench Pro (77.8% vs 53.4%).
Yet Mythos is explicitly withheld from the public and limited to billion-dollar companies, governments, and select researchers, plus Glasswing partners.
Meanwhile, the tools your audience actually uses—Claude Code and Opus—are being called unusable for complex engineering, with reports of hours-long lockouts and a sharp drop in 'thinking' length from ~2,200 to ~600 characters after recent updates.
AMD’s senior director of AI publicly says Claude has regressed and can’t be trusted on complex engineering tasks, and some users suspect older models are being sandbagged ahead of new releases.
For experienced engineers already leaning on agents as copilots, this is a now story about a widening capability gap between what the labs run and what’s in your editor or API key.
security agents as showcase and attack surface
Security is where agent reality is already weird: Mythos-class models are being wired into Project Glasswing so partners can scan critical software for thousands of zero-days, including ancient bugs in OpenBSD, FFmpeg, and Linux.
Small models have independently rediscovered many of the same vulns, and VulGD is proposing a dynamic vulnerability graph DB to track this landscape, so vuln-hunting becomes a continuous graph process rather than ad-hoc scans.
At the same time, personal agents like OpenClaw already run with full local system access, integrate with sensitive services, and are deployed at scale despite unreliable memory and a substantial attack surface.
Research groups are extending Rowhammer to GPUs, making multi-tenant GPU memory a live security concern just as everyone pushes more inference into shared accelerators.
For security-minded agent builders, this is an urgent story: the best real demo of agent power today is vuln-hunting, and it arrives bundled with new, poorly understood threat models.
local coders vs cloud giants
GLM-5.1 is now the #1 open-weights model and #3 globally on SWE-Bench Pro at 58.4, about 95.6% of Claude Opus 2.6’s code-gen competence, and it’s leading GDPval-AA and the Vals Index—all while running locally with full data control.
Users report six-fold database throughput gains (21.5k QPS) from its generated optimizations, which is the kind of concrete win your audience cares about.
Qwen 3.5 27B compiles 100% of backend projects and is ~25× cheaper than competitors, while the 122B variant has become a local king among LLMs for many coders.
On the hardware side, people are running Gemma 4 at 40 tokens/s on an iPhone and 25 tokens/s for the 31B variant on an M-series MacBook, while a 397B-parameter model crawls along at 1.77 tokens/s via NVMe memory extension.
But these open/local stacks still show cracks—GLM-5.1 struggles with long-context coherence, Qwen models can wobble on style consistency and complex multi-subject prompts, and high-throughput multi-user systems still gravitate to vLLM clusters pushing 90–150 tps.
For engineers architecting coding agents and RAG backends, this is a 'this year' story about when local is genuinely enough and where you still reach for GPT/Claude-class APIs or vLLM farms.
rag 2.0: graphs, governance, and poisoned knowledge
Classic 'dump docs into a vector DB and call it RAG' is colliding with its limits: LLMs demonstrably hallucinate on knowledge-heavy tasks and can get worse when you stuff them with more context.
New systems like RefineRAG add word-level refinement to filter poisoned or low-quality snippets, while agentic RAG frameworks introduce planner–retriever–verifier loops instead of single-shot retrieval.
Security and compliance use cases are driving the shift hardest: MA-IDS combines LLMs with RAG for IoT intrusion detection, Skillware layers MiCA-aware regulatory RAG, and VulGD builds a dynamic vulnerability graph so agents can reason over structured exploit data.
Federated unlearning research is emerging to guarantee models and retrieval layers can actually forget deleted data, while practitioners complain about stale context, duplicate chunks, and institutional knowledge loss when AI is bolted into workflows too quickly.
For teams building serious RAG pipelines—security, finance, ops—this is a near-term story about RAG growing up into graph- and governance-aware retrieval, often backed by plain PostgreSQL or SQLite with hybrid FTS+vector search instead of heavyweight vector DBs.
agent plumbing: mcp, graphs, and automation fatigue
The Model Context Protocol has quietly become the default wiring layer for tool-using agents, with over 97M monthly SDK downloads and 177k registered tools, letting LLMs dynamically discover and call external systems from Outlook to browsers.
But a lot of those servers are weekend projects that fail on first use, so people are wrapping them with things like MCP Action Firewall for human-in-the-loop approval on risky calls and VerifiedState for cryptographically signed cross-tool memory.
On top of that, LangGraph and LangChain’s agentic backends promise durable runtimes and shared state across multi-step workflows, even as developers wrestle with tool-scope, auth, and prompt-bloat problems when graphs get big.
In parallel, Zapier opened its SDK for building agents while n8n and similar tools see users both building wild AI workflows and complaining about silent failures and overwhelming complexity, spawning a wave of simpler visual builders for local LLMs.
For infra-savvy readers, this is a 'next few months' story about agent runtimes consolidating around MCP + graph engines while a backlash pushes toward thinner, domain-specific automation layers and old-school CLIs for reliability.
What This Means
The hottest action in AI engineering is shifting from 'which model wins' to how to cage increasingly gated, security-sensitive capabilities inside brittle orchestration stacks and imperfect RAG/memory layers. The gap between benchmark heroics and day-to-day agent reliability keeps widening, and that tension is where the most interesting stories for builders now sit.
On Watch
/HappyHorse 1.0’s upcoming API release and ~100GB model size, plus ongoing skepticism about whether it’s 'real,' could make it a flashpoint for local video-generation workflows and benchmark trust.
/Anthropic’s Managed Agents and memory block sharing, combined with tools like memweave and Milla Jovovich’s LongMemEval-perfect memory system, hint at a fast-approaching wave of opinionated, memory-heavy agent platforms.
/Serverless backlash for AI workloads—Lambda dubbed the 'kiss of death,' cold-start and connection issues, but also new features like S3 NFS mounts and Claude Code plugins—may reshape where teams actually run agents and inference.
Interesting
/Claude Mythos is suspected to be a looped language model, which may provide advantages in tasks like graph search compared to standard models.
/The "spiky" specialization in agentic coding creates friction when switching between models, leading to lost context.
/There is a growing consensus that the future of AI model differentiation will focus on memory management and context handling rather than just raw performance.
/Llama 3.1 8B shows strong performance with single document queries but struggles with multi-document queries, highlighting its limitations.
/A structured test suite evaluated 225 prompt injection attacks across five modalities, highlighting the complexity of multimodal injection detection.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Claude Mythos Preview found thousands of zero-day vulnerabilities across major OSes and browsers, including a 27-year OpenBSD bug, and will only be available to billion-dollar companies, governments, and select partners.
/GLM-5.1 became the #1 open-source and #3 overall model on SWE-Bench Pro with a 58.4 score, and topped the Vals Index and GDPval-AA rankings.
/Meta Superintelligence Labs launched Muse Spark, a natively multimodal, tool-using, multi-agent model that ranks just behind Gemini 3.1 Pro and GPT-5.4 on the Artificial Analysis Intelligence Index.
/The Model Context Protocol (MCP) surpassed 97 million monthly SDK downloads and 177,000 registered tools, becoming the de facto standard for connecting agents to external systems.
/Alibaba’s HappyHorse 1.0 open-source text-to-video model reached #1 in the Artificial Analysis Video Arena for Text and Image to Video (No Audio), licensed under Apache 2.0.
On Watch
/HappyHorse 1.0’s upcoming API release and ~100GB model size, plus ongoing skepticism about whether it’s 'real,' could make it a flashpoint for local video-generation workflows and benchmark trust.
/Anthropic’s Managed Agents and memory block sharing, combined with tools like memweave and Milla Jovovich’s LongMemEval-perfect memory system, hint at a fast-approaching wave of opinionated, memory-heavy agent platforms.
/Serverless backlash for AI workloads—Lambda dubbed the 'kiss of death,' cold-start and connection issues, but also new features like S3 NFS mounts and Claude Code plugins—may reshape where teams actually run agents and inference.
Interesting
/Claude Mythos is suspected to be a looped language model, which may provide advantages in tasks like graph search compared to standard models.
/The "spiky" specialization in agentic coding creates friction when switching between models, leading to lost context.
/There is a growing consensus that the future of AI model differentiation will focus on memory management and context handling rather than just raw performance.
/Llama 3.1 8B shows strong performance with single document queries but struggles with multi-document queries, highlighting its limitations.
/A structured test suite evaluated 225 prompt injection attacks across five modalities, highlighting the complexity of multimodal injection detection.