
More Agents, Worse Results: What We Learned Running Multi-Agent Systems Every Day
By Conny Lazo
Builder of AI orchestras. Project Manager. Shipping things with agents.
I didn't read a paper and change how I build. I built something, watched it break in a specific way, figured out why, fixed it — and then read a paper that described what sounded like the same problem at a much larger scale.
We run multi-agent orchestration as part of our workflow. Text Grand Central, our translation engine, coordinates five specialized agents. We've built sub-agent pipelines for research, coding, and content production. It's a modest setup — a 32GB RAM machine, not a data center — but it's enough to hit the same walls the research describes.
So when a Google/MIT research paper dropped in December 2025 titled "Towards a Science of Scaling Agent Systems" (arXiv:2512.08296), and a Cursor engineering blog post started circulating with the same conclusions we'd independently reached, I paid attention. Because the headline conclusion is counterintuitive, and counterintuitive things that turn out to be true are worth understanding.
The headline: adding more agents can actively make your system worse.
Not "diminishing returns." Not "it gets expensive." Worse. As in, fewer correct answers, more errors, less throughput.
Here's what the research actually shows — and what I've seen with my own hands.
The Research Is More Nuanced Than the Headline
Before I go further: this is not an argument against multi-agent systems. It's an argument for building them correctly.
The Google/MIT paper ran 180 experiments across five architectures and three LLM families. The core finding has two parts that you need to hold together:
- Sequential tasks: Every multi-agent variant degraded performance by 39–70% compared to a capable single agent.
- Parallelizable tasks: Multi-agent architectures improved performance by +80.8%.
That's the whole story in two bullets. Multi-agent systems are powerful — for the right workloads. They're destructive for the wrong ones.
The paper also found — across their specific benchmarks (PlanCraft, Workbench, financial analysis) — that independent agents amplify errors 17.2x compared to single agents, that capability saturation kicks in at roughly 45% single-agent accuracy (below that baseline, coordination overhead destroys value), and that tool-heavy tasks carry a 2–6x efficiency penalty in multi-agent setups. These are benchmark results, not universal constants — but the patterns match what practitioners are seeing in production.
These technical failure modes have business consequences. Gartner estimates that 40% of agentic AI projects will be cancelled by 2027 — citing escalating costs, unclear business value, and inadequate risk controls. The research suggests a specific reason: teams are scaling architectures that work at 3 agents but collapse at 30, and by the time they discover the problem, they've already invested heavily.
We Hit the Same Wall Before We Knew the Research Existed
The specific failure mode that convinced me was this: we had a set of related coding sub-agents working in parallel on an interconnected codebase. Each agent did its job. Each one produced reasonable output. And when we tried to merge, we had overlapping mega-PRs with conflicts that took longer to resolve than it would have taken one agent to do the whole thing sequentially.
The experience rhymes with what Cursor documented at a much larger scale: 20 flat agents produce the output of 2–3.
Peer coordination doesn't scale. The fundamental problem is that agents operating as a "flat team" — sharing context, aware of each other, trying to coordinate — introduce communication overhead that compounds with every agent you add. Distributed systems engineers have known this for decades. What surprised me was how directly the same principles apply to LLM agents.
The Five Rules We Now Build By
Cursor, Steve Yegge (veteran Google/Amazon engineer, whose Gas Town framework — his fourth orchestrator attempt after three failures — documents his approach to orchestrating 20-30 Claude Code instances), and the Google/MIT research all converge on principles that we've been learning the hard way at a smaller scale:
1. Two Tiers, Not Teams

Don't build a flat team of agents. Build a hierarchy: one orchestrator that plans, and workers that execute. The orchestrator holds the big picture. Workers don't need it. We run three tiers in practice — Opus for planning, Sonnet for substantive work, Haiku for mechanical tasks — but the key insight is the same: hierarchy over peer coordination.
2. Workers Stay Ignorant
If you've built distributed systems, this will sound familiar: stateless workers with minimum viable context. Each worker agent gets exactly enough to complete its specific task — not the full picture, not the history, not what other agents are doing. The principle isn't new — it's the same reason it works in microservices: shared context creates coupling, and coupling kills parallelism. We give each sub-agent a precise task prompt with constraints — and nothing else.
3. No Shared State

Same principle, different layer — and another one distributed systems engineers will recognize: shared-nothing architecture. Workers don't coordinate with each other. They deliver results to an external system — a git branch, a PR queue, a memory file — and something else (the orchestrator, a merge process) handles integration. When agents share state directly, you get exactly the conflict problem we hit. When they write to isolated outputs and merge externally, the conflicts are tractable.
4. Plan for Endings
This one sounds philosophical but it's deeply practical. Agents operating in long-running loops accumulate context window debt. Performance degrades. Errors compound. The solution is episodic operation: agents run, complete a defined unit of work, and terminate. State is committed externally before termination. The next run starts fresh.
We use Ralph scripts for this — autonomous Claude Code loops with built-in context resets. Steve Yegge's Gas Town framework solves the same problem differently: workflow state is captured as "molecules" (chained issues in a git-backed tracker called Beads), so if an agent crashes, compacts, or simply ends, the next session picks up the molecule where it left off. Yegge calls this GUPP — the Gas Town Universal Propulsion Principle. Ralph wipes context and restarts fresh; Gas Town preserves workflow state externally and resumes. Different mechanisms, same core insight: design for endings, not against them.
5. Prompts Over Infrastructure
This is the one that changes how you spend your time.
A study of multi-agent LLM system failures (Cemri et al., arXiv:2503.13657) found that 79% trace back to specification and coordination issues. Only 16% are infrastructure failures. That means if your multi-agent system is broken, the problem is almost certainly in how you defined the task and the coordination protocol — not in your deployment stack, your error handling, or your retry logic.
We've experienced this. The orchestration engine with error recovery and concurrency management matters — but a bad task spec will defeat good infrastructure every time. Write the prompt like a contract. Define inputs, outputs, constraints, and exit conditions. Ambiguity compounds across agents.
When to Use Multi-Agent Systems (And When Not To)

The research gives us a clear decision framework:
Use multi-agent for:
- Tasks that can be parallelized across independent subtasks
- Work where each unit is self-contained (translate this document, review this PR, analyze this dataset)
- Scale-out scenarios where you need volume, not coordination
Don't use multi-agent for:
- Tasks with sequential dependencies where each step depends on the previous
- Tool-heavy workflows (the 2–6x penalty is real and will kill your throughput)
- Any situation where your single-agent accuracy is below ~45% — fix the baseline first
The throughput numbers are stark: single agents handle 67 tasks per 1,000 tokens. Centralized multi-agent architectures handle 21 tasks per 1,000 tokens. If your task is sequential, you're paying three times more to get worse results.
What I Actually Learned
Running these systems has taught me things the research confirms but doesn't quite capture:
Hierarchy exists to collapse coordination overhead. In a flat team, every agent potentially interacts with every other — that's combinatorial growth. In a two-tier system, each worker has exactly one interface: the orchestrator. That's the whole point.
The paper validates what works, not just what fails. The +80.8% improvement on parallelizable tasks is real. We see it in content pipelines, in translation workflows, in research tasks. Multi-agent isn't the problem. Misapplied multi-agent is the problem.
Prompts are load-bearing. When something breaks in one of our agent workflows, it's almost always a spec problem. The agent did exactly what we asked. We asked for the wrong thing, or asked for the right thing ambiguously. Writing a good agent prompt takes longer than writing a good function signature — treat it with the same rigor.
The Honest Bottom Line
Gartner will be right that 40% of agentic AI projects get cancelled by 2027. But I don't think those projects fail because multi-agent AI is overhyped. I think they fail because teams scale before they understand what they're scaling.
If your single agent can't do the task reliably, adding more agents multiplies the failure. If your task is sequential, coordination overhead will eat your gains. If your prompts are vague, workers will fill in the gaps in incompatible ways.
The research from Google and MIT, the lessons from Cursor, the Gas Town framework from Steve Yegge — they all point at the same thing: multi-agent systems are an architectural pattern, not a scaling shortcut. Apply the pattern correctly and you get real leverage. Apply it incorrectly and you get 17x error amplification.
We got there by building and breaking things. The research would have saved us time — but I'm not sure we'd have understood it without the scars.
Sources & Inspiration
- Towards a Science of Scaling Agent Systems — Google/MIT (arXiv:2512.08296, December 2025) — The 180-experiment study on multi-agent scaling dynamics
- Steve Yegge — Gas Town Framework — Multi-agent orchestration for Claude Code instances
- Cursor Engineering Blog — Scaling long-running autonomous coding (January 2026) — Cursor's documented journey from flat teams to hierarchical agents
- Gartner — Over 40% of Agentic AI Projects Will Be Cancelled by End of 2027 (June 2025)
- Cemri et al. — Why Do Multi-Agent LLM Systems Fail? (arXiv:2503.13657, March 2025) — Failure analysis: 79% spec/coordination, 16% infrastructure
- Anthropic Claude Documentation — Orchestration patterns