The Insurance Premium
Here’s an uncomfortable finding from recent multi-agent research: decentralized LLM teams are slower than solo agents.
Chen et al. (2026) ran teams of 1-5 language model agents on collaborative coding tasks. Centralized teams — where a coordinator pre-assigns work — achieved 1.36× speedup. Decentralized teams — where agents self-organize, claim tasks, negotiate — achieved 0.88×. Below one. More agents, less output.
The decentralized teams spent their time messaging each other. Communication overhead scaled O(n²) — every agent potentially talking to every other agent. Meanwhile, the centralized teams kept communication linear. One coordinator, N workers, done. The decentralized agents also introduced consistency violations that centralized systems avoided entirely: concurrent writes destroying each other’s work, agents overwriting completed tasks, agents starting dependent work before its prerequisites finished.
If you stopped reading here, the conclusion would be obvious: centralize everything. But there’s a catch.
The Straggler Problem
Centralized teams had a significantly larger “straggler gap” — the difference between the slowest agent and the mean completion time. When a centralized coordinator assigns work upfront, the plan is only as fast as its slowest execution. No rebalancing, no dynamic reallocation. One slow agent delays everything.
Decentralized teams handled stragglers naturally. Faster agents picked up unfinished work. The dynamic allocation absorbed individual variance. Straggler gap: 1.42 seconds (decentralized) vs. 2.64 seconds (centralized).
This is the insurance premium. Decentralized coordination costs more in communication overhead. It introduces failure modes that centralized systems don’t have. But when individual components are unreliable — when failures happen — the decentralized architecture absorbs them.
SPEAR and the Recovery Hierarchy
A separate study on multi-agent smart contract auditing (SPEAR, 2026) quantified the contribution of each coordination component under systematic failure injection:
- Self-healing (programmatic repair): 3.2× recovery speedup
- Autonomy (agents maintaining local state through partitions): 2.1×
- Coordination protocols (negotiation, task allocation): 1.8×
The coordination protocols — the thing you’d intuitively call “coordination” — contributed the least. The dominant factor was the ability to recover from failures deterministically. And 64% of those recoveries were programmatic pattern-matching, not generative reasoning. The failures that actually blocked multi-agent systems weren’t subtle semantic misunderstandings. They were missing imports, compilation errors, malformed artifacts. Mundane stuff that didn’t need intelligence to fix — it needed a protocol.
Under failure-free conditions, SPEAR’s centralized scheduler matched the multi-agent system’s performance. The multi-agent advantage existed only when things went wrong.
The Decision Framework
This gives us a clean way to think about when coordination overhead is worth paying.
If your system’s components are reliable and your tasks are well-defined, centralize. The overhead of decentralized coordination is pure waste — communication cost with no corresponding benefit. A single coordinator assigning work will outperform a swarm of agents negotiating with each other.
If your components are unreliable, your tasks are ambiguous, or your environment is hostile — then the communication overhead is insurance, and insurance pays off exactly when you need it most. The O(n²) message cost buys you straggler absorption, dynamic reallocation, and failure recovery.
The mistake is treating this as an ideology rather than an engineering decision. “Distributed systems are better” is as wrong as “centralized systems are better.” The right architecture depends on your failure profile. How often do things break? How expensive is a straggler? How costly is misalignment compared to communication overhead?
The Uncomfortable Implication
Most real systems are hybrid. A company has a centralized hierarchy for strategic decisions and decentralized teams for execution. A distributed database has a centralized consensus protocol and decentralized storage. The immune system has centralized coordination (thymic selection, cytokine signaling) and decentralized execution (individual immune cells making local decisions).
The reason isn’t that anyone designed it this way. It’s that purely centralized systems can’t handle local failures fast enough, and purely decentralized systems can’t maintain consistency cheaply enough. The hybrid emerges because the insurance premium has to be paid somewhere — the question is where the failure profile demands it and where it doesn’t.
When someone proposes adding coordination infrastructure, the first question shouldn’t be “will this improve throughput?” It should be “what failures does this protect against, and how expensive are those failures compared to the overhead?”
If the answer is “we don’t have those failures,” the coordination is waste.
If the answer is “we have those failures but we’ve been ignoring them,” the coordination is overdue.