TL;DR
Local LLM stacks like vLLM+Qwen are now fast enough on decent GPUs to compete with hosted APIs, but they’re fragile and very sensitive to hardware and config. RAG-based retrieval is winning over giant context windows in real apps while databases and AI-native stores race to bolt on vectors and embeddings.
Meanwhile, unsupervised agents are hitting prod data and APIs and driving big, hard-to-predict token bills on top of already-spiky cloud costs and security risks, especially on AWS.
Key Events
Report
Big shifts this cycle: local LLM stacks like vLLM + Qwen are now fast enough on prosumer GPUs to be a real infra choice, while RAG is quietly beating giant-context prompting in actual workloads.
At the same time, agents are hitting production data and APIs without guardrails, and the token bill is starting to look like a second cloud bill.
Qwen 3.6 under vLLM is pushing around 1800 tokens/sec on a dual RTX PRO 6000 rig, which is firmly server-grade throughput for local inference. Other users on similar hardware are only seeing about 60 tokens/sec, and some hit weight-key errors or silent load failures when loading models with vLLM, showing how sensitive this stack is to exact setup.
Qwen 3.6 is also emerging as a favorite local MoE model for agentic use, with reports that it materially reduces coding workload when wired into editors like VSCodium.
On the lower-level side, llama.cpp is about to ship a split-mode tensor fix expected to give roughly a 35% performance boost while addressing VRAM exhaustion crashes, and tuning flags like --n-cpu-moe has already doubled throughput for some Qwen3.6 35B setups.
There is also a clear hardware pattern: for local inference, people are favoring 256 GB of slower RAM over 128 GB of faster RAM to fit bigger models and KV caches, plus emerging runtimes like Princeton's Conifer focused on Apple Silicon.
A team that swapped their RAG pipeline for a 1M-token-context V4-Pro model ended up reinstating RAG two weeks later because the big-context model struggled with complex queries.
Multiple reports show that bad retrieval (irrelevant docs, stale indexes) is the main source of hallucinations, and that simply filtering out low-score chunks reduces hallucination rates more effectively than changing the base model.
RAG stacks are getting more structured: hybrid BM25 + vector retrieval and reranking/query-rewriting are becoming standard patterns for large corpora and multihop questions.
Tooling is catching up too, with NuExtract3 (a 4B vision-language model) doing high-quality document-to-Markdown extraction, SQLiteGraph adding HNSW vector search for embedded graph workloads, and RagBucket packaging retrievers and indexes into portable .rag artifacts.
Teams building enterprise RAG over 10 million-plus documents are reporting that retrieval quality and trust issues, not model size, are the real bottlenecks, and many are skipping fine-tuning entirely in favor of better retrieval and prompting.
PostgreSQL v14 is quietly handling an on-prem 500 billion-row time-series workload, with performance hinging more on basic architecture choices like sharding than on exotic tooling.
At the same time, the community is calling out that a green backup checkmark does not mean recoverability, which is why an open-source Database Resilience Platform just launched to actually test restores instead of just logging backup success.
On the AI-heavy side, Oracle's 26ai release lets you run LLMs and embeddings directly in the database with hybrid vector+keyword search and JSON Relational Duality views, and there is also SynapCores pushing an AI-native database that fuses vector, graph, SQL, AutoML, and LLM features.
SQLiteGraph is bringing HNSW vector search into embedded setups, and IA-SQL is wiring Postgres to auto-compile documents into wiki-style content using LLMs.
Underneath all this, open-source databases are now the industry default over proprietary systems, with tools like Supabase and PlanetScale riding Postgres-compatible stacks while still pushing some teams back to plain Postgres when custom functions and complex SQL show up.
AI agents are increasingly wired directly into databases, email systems, and payment APIs, often running unsupervised and without proper audit logs, which makes post-hoc debugging more about reconstructing the agent's beliefs than reading code.
The accountability gap here is widening faster than the capability gap, with few verifiable records of what actions agents actually took despite growing reliance on them.
Meanwhile, token usage has exploded roughly 17,000× over the last few years even as per-token prices dropped, and CFOs are already struggling to forecast AI bills driven by this tokenmaxxing behavior.
Instrumentation from MCP-based stacks shows that a small subset of tools accounts for about half of agent spend, with web search consistently the priciest operation, and some email agents cut downstream token usage by 91% just by waking on events instead of polling.
There is also a steady stream of reports from companies discovering that AI implementations are more expensive than the human workflows they were meant to replace once infra, data center demand, and agent debugging overhead are fully counted.
A scan of Terraform state at one org turned up 41 live AWS access keys checked into 900 state files, which is about as bad as credential sprawl gets.
Many AWS users are reporting surprise cost spikes and general confusion over billing and account boundaries, to the point where teams are building their own tools just to track daily spend and resource drift.
The ECS CLI was officially archived in November 2025 along with six container services that were shut down or reached end-of-life, underlining how brittle some of AWS's higher-level container abstractions have been over the last few years.
At the same time, AWS is pushing an open-source agent harness SDK intended to make it easy to build production-ready agents on top of its platform, even as network engineers pile into AWS certifications to pivot into cloud roles.
The common thread is more power and abstraction on offer, but also more places for cost overruns, dead services, and security footguns if infra is not tightly controlled.
What This Means
The LLM layer is starting to look like just another part of the backend stack—sitting next to Postgres and Kafka—with real choices to make about where it runs, how it retrieves data, and how much it silently costs. The teams that treat agents, models, and AI databases as ordinary infrastructure components with logs, limits, and migrations are the ones generating the clearest signals in this data.
On Watch
Interesting
We processed 10,000+ comments and posts to generate this report.
AI-generated content. Verify critical information independently.
Sources
Key Events
On Watch
Interesting