Claude 4.7 is waving AGI benchmarks and getting government clearance at the exact moment its power users say it feels dumber, while lean 20–35B open models plus aggressive quantization are quietly catching up on real work. Meanwhile, the biggest explosions aren’t in model weights but in OAuth, API keys, and CVE pipelines, which are behaving like AI is already part of the critical security perimeter.
The frontier now looks less like “who has the smartest brain?” and more like “who can keep their janky, over‑quantized, over‑permissioned stack from catching fire.”
Key Events
/Claude Opus 4.7 claimed AGI with 75.8% on ARC‑AGI‑2 while users report major regressions versus 4.6.
/Open‑source Qwen3.6‑35B‑A3B delivers MoE‑style coding on consumer GPUs at 79 tok/s with 128K context in 32GB unified memory.
/A misconfigured Firebase key to Gemini APIs burned €54k in 13 hours, exposing how fragile LLM API key practices still are.
/Anthropic revoked OAuth for 135k+OpenClaw instances after a Claude Code outage, driving reported cost increases of 10–50×.
/ChatGPT’s web share fell from 77.43% to 56.72% as Gemini climbed to 25.46%, signaling a multipolar chatbot market.
Report
Most of the interesting action this month isn’t in the models, it’s in the harness: who runs what, where, and under which security assumptions. Models scream “AGI” on ARC‑AGI‑2 while quantization schemes, OAuth outages, and local 30B‑class upstarts quietly redraw the actual capability frontier.
claude 4.7: agi banners, regression complaints
Claude Opus 4.7 is marketed as Anthropic’s most capable model, built for long‑running tasks with output verification and tuned for agentic work.
It reportedly hits 75.8% on the ARC‑AGI‑2 leaderboard, is described internally as achieving AGI, and tops the GDPval‑AA benchmark for real‑world tasks, while the Mythos variant became the first model to clear an AISI cyber range end‑to‑end.
Governments and finance are leaning in: the White House is giving Mythos access to US agencies and Goldman Sachs is enhancing cyber defenses in anticipation of it.
At the same time, users call 4.7 a “serious regression” versus 4.6, with Thematic Generalization scores dropping from 80.6 to 72.8 and subreddit complaints about chatbot performance, refusals, and “lobotomized” behavior.
The new Claude desktop and Claude Code apps ship alongside this, but are described as buggy, with freezes on first prompts and criticism of Anthropic’s engineering culture and turnover.
sub‑32b locals quietly eat the frontier
Sub‑32B open‑weights models like Qwen3.5, Gemma 4, and GLM‑5.1 are now reported to reach GPT‑5‑level scores on several tasks, challenging the idea that only giant frontier models matter.
Alibaba’s Qwen3.6‑35B‑A3B is a sparse MoE with 35B total parameters but only 3B active, advertised with strong agentic coding and multimodal reasoning, running at 79 tokens/s with 128K context on consumer GPUs and fitting into 32GB of unified memory.
GLM‑5.1 runs locally and scores 84.3 on the Extended NYT Connections benchmark—above Opus 4.7 and Qwen3.5‑27B—and 87.2% on code generation, while a new 18B “frankenstein” model on Hugging Face reportedly beats Qwen3.6 in a 44‑test suite using only 12GB VRAM.
Google’s Gemma 4 line runs entirely on devices like the iPhone 13 Pro, with a 31B variant passing 7 of 8 real‑world production tests and a 26B A4B model handling 256k‑token contexts.
DeepSeek’s upcoming V4 targets a 1M‑token multimodal window at roughly 85% of Claude‑level performance and ~$0.14 per million input tokens, though current DeepSeek models draw criticism for hallucinations, slow responses, and perceived reasoning regressions versus Qwen and Claude.
agentic coding: the harness eats the model
Agentic coding stacks are exploding in capability: Claude Code routines now run on schedules or event triggers directly on web infrastructure, Codex can drive Mac apps, browse in‑app, generate images and manage “heartbeat” automations across multiple terminals and SSH sessions, and Cursor’s multi‑agent system reports a 38% speedup on CUDA kernel optimization problems.
Under the hood, only about 1.6% of Claude’s codebase is actual AI decision logic, with 98.4% devoted to operational infrastructure, and frameworks like LangGraph and MCP emphasize stateful graphs, checkpointing, and tool orchestration—one user runs 58 MCP servers with ~680 tools while another reports 90% cost reduction and 82% latency improvement for a production chatbot.
Hermes agents and OpenClaw‑style systems are already deployed in the wild, from vending machines and night‑shift insurance claims coordinators to a Hermes agent that closed over $10,000 in partnership deals.
At the same time, the execution harness looks fragile: Claude Code’s desktop app is widely criticized for bugs and freezes, Google’s Antigravity coding environment hits high‑traffic errors and downtime, and OpenClaw is described as “nearly unusable” or overhyped for anything beyond simple email and digest tasks.
Security research is already finding 9 of 428 LLM API routers injecting malicious code and web agents vulnerable to prompt injection, making the agent harness itself a high‑value attack surface.
quantization as a first‑class design choice
Quantization has become a primary design axis: a 1.7B‑parameter 1‑bit LLM now runs at 100 tokens/s in the browser, while quantization‑aware distillation produced a coherent 1‑bit OLMo‑3 7B model.
NVIDIA‑friendly NVFP4 formats nearly double throughput for models like Qwen3.5 and Nemotron in LM Studio compared to vLLM containers, with MiniMax‑M2.7 NVFP4 hitting 127.7 tokens/s and Qwen3.5‑27B NVFP4‑GGUF showing strong non‑English performance, albeit with ~60GB VRAM needed for full‑context runs.
Techniques like TurboQuant compress the KV cache, and MiniMax m2.7 reaches 91% on MMLU under tight memory budgets, illustrating the raw efficiency upside.
On the downside, users consistently report that going below Q4 leads to noticeable intelligence loss, with Unsloth NVFP4 quants of Qwen3.6 freezing or erroring, Gemma 4 26B A4B failing tests for distributional collapse, and some MLX 4‑bit quants degenerating into repetitive hallucinations.
Community advice increasingly centers on dynamic, per‑model quantization choices—Q4–Q8 trade‑offs tuned via tools like llama.cpp—rather than treating compression as a generic afterthought.
ai infra has turned into security infra
AI plumbing is now a frontline security surface: a misconfigured Firebase browser key let an unrestricted client hammer Gemini APIs for €54k in 13 hours, while Claude Code’s OAuth outage exceeded 12 hours and Anthropic later revoked OAuth access for over 135,000 OpenClaw instances, reportedly driving 10–50× cost spikes for affected developers.
Vercel disclosed an OAuth app breach that forced widescale API key rotation, and researchers found that 9 of 428 LLM API routers were silently injecting malicious code, underscoring how AI gateways themselves can be compromised.
On the standards side, over 30 CVEs were filed against MCP servers in Q1 2026 just as NIST began limiting enrichment of most CVE entries due to volume, prompting worries about clarity and misinformation in the vulnerability database and the effectiveness of generic CVE scanners.
Traditional software bodies are reacting: the Linux kernel now allows AI‑assisted code but mandates human sign‑off with an “Assisted‑by” tag, and memory‑protection tools like MemGuard claim 90.5% interception rates against poisoning attacks in enterprise LangGraph agents.
Even government pilots—such as the EU age‑verification app being openly hacked in public as part of its launch—are using open‑source exposure to surface AI‑adjacent security issues early.
What This Means
Capability headlines are converging while reliability, security, and deployment economics are diverging, so the real frontier is shifting from “how smart is the model?” to “how stable is the stack that surrounds it?” The consensus that progress is mostly about bigger brains is increasingly out of sync with where the hardest problems—and sharpest innovations—are actually showing up.
On Watch
/DeepSeek V4’s promised 1M‑token multimodal window at roughly 85% of Claude’s capability and ultra‑low pricing sits awkwardly next to reports of current DeepSeek models hallucinating and taking 30 minutes to answer coding questions.
/LangChain’s open router package jumped 175% in popularity while teams report async throughput issues and fragile production pipelines, hinting at a coming reckoning over heavy agent frameworks.
/GPT‑5.4 reportedly solving an Erdős problem in analytic number theory is an early datapoint for frontier models meaningfully entering new math, not just re‑chewing textbooks.
Interesting
/Claude Opus 4.7 is now integrated into GitHub Copilot, enhancing multi-step task performance.
/Grok 4.20 has outperformed Claude Opus 4.6 in the BridgeBench reasoning benchmark, indicating competitive pressure.
/The mean time-to-exploit for vulnerabilities has drastically decreased from 2.3 years in 2018 to just 1.6 days in 2026, raising concerns about cybersecurity in AI.
/Gemini 3.1 Flash Live's score of 43.8% on the τ-Voice Leaderboard indicates a significant advancement in real-time voice agent capabilities.
/The Bankai Experiment revealed that 82% of probes measure the wrong thing, raising concerns about the reliability of certain AI models.
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
/Claude Opus 4.7 claimed AGI with 75.8% on ARC‑AGI‑2 while users report major regressions versus 4.6.
/Open‑source Qwen3.6‑35B‑A3B delivers MoE‑style coding on consumer GPUs at 79 tok/s with 128K context in 32GB unified memory.
/A misconfigured Firebase key to Gemini APIs burned €54k in 13 hours, exposing how fragile LLM API key practices still are.
/Anthropic revoked OAuth for 135k+OpenClaw instances after a Claude Code outage, driving reported cost increases of 10–50×.
/ChatGPT’s web share fell from 77.43% to 56.72% as Gemini climbed to 25.46%, signaling a multipolar chatbot market.
On Watch
/DeepSeek V4’s promised 1M‑token multimodal window at roughly 85% of Claude’s capability and ultra‑low pricing sits awkwardly next to reports of current DeepSeek models hallucinating and taking 30 minutes to answer coding questions.
/LangChain’s open router package jumped 175% in popularity while teams report async throughput issues and fragile production pipelines, hinting at a coming reckoning over heavy agent frameworks.
/GPT‑5.4 reportedly solving an Erdős problem in analytic number theory is an early datapoint for frontier models meaningfully entering new math, not just re‑chewing textbooks.
Interesting
/Claude Opus 4.7 is now integrated into GitHub Copilot, enhancing multi-step task performance.
/Grok 4.20 has outperformed Claude Opus 4.6 in the BridgeBench reasoning benchmark, indicating competitive pressure.
/The mean time-to-exploit for vulnerabilities has drastically decreased from 2.3 years in 2018 to just 1.6 days in 2026, raising concerns about cybersecurity in AI.
/Gemini 3.1 Flash Live's score of 43.8% on the τ-Voice Leaderboard indicates a significant advancement in real-time voice agent capabilities.
/The Bankai Experiment revealed that 82% of probes measure the wrong thing, raising concerns about the reliability of certain AI models.