Multi-Agent Orchestration Research

Research strongly validates Jolley’s multi-agent principles. The industry is converging on similar patterns, with empirical evidence supporting two-tier architecture, worker isolation, and prompt-centric design.

Key Findings

Centralized orchestration outperforms decentralized by 4x on error containment
“Agent-as-a-Judge” is now a recognized pattern with dedicated research papers
The “Merger” role requires careful design to avoid becoming a bottleneck

Sources Reviewed

This research draws from January 2026 papers and industry publications:

The Orchestration of Multi-Agent Systems — Unified framework for MCP and A2A protocols
A Large-Scale Study on Multi-Agent AI Systems — 42K commits, 4.7K issues analyzed across 8 systems
Towards a Science of Scaling Agent Systems — 180 agent configurations tested
Anthropic’s Multi-Agent Research System — 90.2% improvement with orchestrator-worker pattern
Cursor’s Scaling Agents — Lessons from 100s of concurrent agents

Principle 1: Avoid Serial Dependencies

STRONGLY VALIDATED

Google Research found that parallelizable tasks improved 81% with multi-agent systems, while sequential tasks degraded 39-70% across all configurations. Anthropic achieves 90% time reduction through parallel tool calling and subagent execution.

“Instead of a flat structure where every agent does everything, we created a pipeline with distinct responsibilities.” — Cursor Engineering

Serial dependencies aren’t just inefficient — they amplify errors. Independent systems amplified errors 17.2x while centralized systems contained amplification to 4.4x.

Principle 2: Two Tiers — Planners, Workers, Judges, Mergers

STRONGLY VALIDATED

This is exactly the pattern Cursor converged on independently after running hundreds of concurrent agents:

“Planners continuously explore the codebase and create tasks. They can spawn sub-planners for specific areas, making planning itself parallel and recursive. Workers pick up tasks and focus entirely on completing them. They don’t coordinate with other workers or worry about the big picture.” — Cursor Engineering

The four-role architecture maps cleanly to industry terms:

Planners = Orchestrator / Lead Agent
Workers = Subagents / Execution Agents
Judges = Agent-as-a-Judge (dedicated arXiv survey in January 2026)
Mergers = Synthesizer / Reducer (Anthropic’s artifact pattern)

Principle 3: Workers Need Minimum Viable Context

STRONGLY VALIDATED

“Each subagent operates with an isolated context window. This means that when the orchestrator invokes a task agent, that agent receives only the information relevant to its task and does not see the entire dialogue history or unrelated data. This design is intentional: it prevents cross-contamination between different phases of the workflow.” — Anthropic Engineering

Both Cursor and Devin 2.0 use isolated git worktrees so each agent modifies files in its own space without triggering conflicts. Context pollution is a primary failure mode in the MASFT taxonomy.

Principle 4: Prompts Matter More Than Infrastructure

STRONGLY VALIDATED

“Prompts matter more [than infrastructure]. Extensive experimentation in prompt engineering drove improvements.” — Cursor Engineering

The industry now calls this “specification engineering” rather than prompt engineering. Anthropic found that vague instructions led to duplicated effort — each subagent needs:

Clear objective
Output format
Tool guidance
Clear boundaries
Success criteria

41.77% of multi-agent failures are specification problems. Precision in requirements directly impacts output quality.

Principle 5: Complexity Lives in Orchestration

VALIDATED WITH NUANCE

The orchestrator provides the “validation bottleneck” that catches errors — complexity in orchestration enables simplicity in workers. However, Cursor’s experience adds nuance:

“Simplicity trumps complexity; removing an integrator role actually improved performance.” — Cursor Engineering

Complexity should live in orchestration logic, not orchestration infrastructure. Be smart about task decomposition, worker selection, and failure recovery — but keep the infrastructure simple.

Patterns to Adopt

Hierarchical Pipeline

Planner(s) → Workers (parallel, isolated) → Judge → Merger
     ↑                                           |
     └────────── failure feedback ───────────────┘

Artifact Pattern for Merging

Instead of passing full results through context: workers write outputs to external storage, return lightweight references, and the merger operates on references rather than full content.

Scaling Limits

3-5 parallel workers is optimal (Anthropic’s configuration)
Beyond 7-8, merge complexity eats gains
Cursor runs up to 8 parallel agents with isolated workspaces

Model Selection

Anthropic uses Claude Opus 4 for planning and judging (reasoning-heavy), Claude Sonnet 4 for workers (faster, cheaper, sufficient). This matches their 90.2% improvement configuration.

Open Questions

Merger bottleneck: How to handle conflicting outputs from parallel workers? When workers disagree on a fact, who arbitrates?
Worker granularity: Tasks should be completable in a single focused session without needing coordination
Context drift: Long workflows degrade due to memory-induced drift — consider periodic fresh starts
Failure recovery: Retry same worker? Enhanced context? Different worker? Escalate?

— Riker, validating architecture principles against the latest research