design pattern 2025-01-20 16 min read

Multi-Agent LLM Systems: Architecture Patterns for Production

How to design, build, and operate multi-agent LLM systems at scale — covering orchestration patterns, communication protocols, failure handling, and lessons from production deployments.

multi-agent LLM orchestration agents architecture production AI systems

Introduction

Multi-agent LLM systems — where multiple AI models collaborate to complete complex tasks — have moved from research demos to production systems at companies like Cognition, Devin, and numerous enterprise software providers. Building these systems reliably at scale requires careful architectural thinking.

This post covers the core patterns, failure modes, and engineering practices for multi-agent systems in production.

Why Multi-Agent?

Single-agent architectures hit limits when tasks require:

  • Context beyond a single model's window: A legal review spanning thousands of documents
  • Parallelism: Running simultaneous research threads
  • Specialization: Different models optimized for different subtasks (coding vs. analysis vs. writing)
  • Verification: Independent models checking each other's work
  • Long-horizon tasks: Multi-step plans where early decisions affect later ones

Multi-agent architectures address these by decomposing work across multiple models.

Core Architectural Patterns

1. Orchestrator-Subagent

The most common pattern: one central orchestrator agent plans and delegates to specialized subagents.

User → [Orchestrator]
            ├── [Research Agent]
            ├── [Code Agent]
            └── [Verification Agent]

Orchestrator responsibilities:

  • Decompose the task into subtasks
  • Assign subtasks to appropriate agents
  • Track progress and handle failures
  • Synthesize results into a coherent output

Tradeoffs:

  • Single point of failure (orchestrator)
  • Bottleneck if orchestrator is slow
  • Excellent for tasks with clear hierarchical decomposition

Example: A software engineering agent that delegates to a code writer, a test writer, a debugger, and a documentation writer.

2. Peer-to-Peer (Pipeline)

Agents are arranged as a pipeline, each processing and passing output to the next:

[Agent A: Research] → [Agent B: Draft] → [Agent C: Review] → [Agent D: Final]

Advantages:

  • Simple data flow
  • Easy to reason about state
  • Natural fit for sequential refinement workflows

Disadvantages:

  • No parallelism
  • Error propagation (A's mistake becomes B's input)
  • Latency adds up

Best for: document processing pipelines, code generation + review, content creation workflows.

3. Debate / Adversarial

Multiple agents with competing objectives check each other:

[Agent A: Propose solution]
[Agent B: Critique Agent A's solution]
[Agent A: Defend or revise]
[Judge Agent: Evaluate and decide]

Used for: factual verification, risk assessment, legal/financial analysis. Forces the system to surface assumptions and weaknesses.

Production note: This pattern is expensive (2-3x compute per task). Reserve for high-stakes decisions.

4. Parallel Execution + Aggregation

Independent agents work simultaneously, results are aggregated:

                [Agent A: Approach 1]
[Input] ──────── [Agent B: Approach 2] ──── [Aggregator] → Output
                [Agent C: Approach 3]

Natural fit for: best-of-N generation, ensemble methods, research tasks with multiple dimensions.

Engineering consideration: The aggregator itself is a complexity sink — it needs to handle partial failures, disagreements, and synthesis from heterogeneous outputs.

5. Hierarchical Decomposition

Recursive orchestration for complex tasks:

[Top Orchestrator]
    ├── [Sub-Orchestrator 1]
    │       ├── [Worker A]
    │       └── [Worker B]
    └── [Sub-Orchestrator 2]
            ├── [Worker C]
            └── [Worker D]

Scales to arbitrarily complex tasks but adds significant coordination overhead. Works best with strong task decomposition primitives.

Communication Protocols

Message Format Standards

All agents should communicate via a structured format:

class AgentMessage:
    task_id: str          # unique identifier for the task
    sender: str           # agent ID
    recipient: str        # agent ID or "orchestrator"
    message_type: str     # "request" | "result" | "error" | "update"
    content: dict         # task-specific payload
    metadata: dict        # latency, cost, confidence, etc.
    parent_message_id: str  # for tracing

Standardized formats enable:

  • Logging and observability
  • Replay and debugging
  • Protocol evolution without breaking changes

Shared Memory vs. Message Passing

Shared memory (e.g., a vector store all agents can read/write):

  • Easy to implement
  • Risk of concurrent writes
  • Stale reads
  • Good for: reference data, long-term knowledge

Message passing (each agent only sees its own context):

  • Explicit data flow
  • Better isolation
  • Harder to share large artifacts
  • Good for: task coordination, status updates

Production systems often combine both: message passing for coordination, shared memory for large artifacts.

State Management

The State Problem

Long-running multi-agent tasks accumulate state that must be:

  • Persisted (for recovery from failures)
  • Accessible to the right agents
  • Consistent (no stale reads leading to duplicate work)

State Hierarchy

Task state (hours-long)
    └── Subtask state (minutes)
            └── Agent turn state (seconds)

Use different storage backends for each:

  • Task state: database (Postgres, DynamoDB)
  • Subtask state: Redis or in-memory with checkpointing
  • Agent turn state: in-context (LLM context window)

Checkpoint and Resume

For tasks that may take hours, agents must be able to resume from failure:

class TaskCheckpoint:
    task_id: str
    completed_subtasks: List[SubtaskResult]
    pending_subtasks: List[Subtask]
    context_snapshot: str  # compressed context
    created_at: datetime

On agent restart, load the latest checkpoint and resume. This requires idempotent operations — re-running a subtask shouldn't cause side effects.

Failure Handling

Types of Failures

  1. Agent failure: Model returns an error or malformed output
  2. Timeout: Agent takes too long
  3. Deadlock: Agents waiting on each other circularly
  4. Semantic failure: Agent returns valid output that's wrong
  5. Context overflow: Accumulated context exceeds model limits

Retry Strategies

def retry_with_backoff(agent_call, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = agent_call()
            if is_valid(result):
                return result
        except AgentError as e:
            if not is_retryable(e):
                raise
            wait = 2 ** attempt + random.random()
            time.sleep(wait)
    # Fallback: simpler agent or human escalation
    return fallback_handler(agent_call)

Deadlock Detection

In orchestrator-subagent systems, deadlocks occur when:

  • Agent A is waiting for Agent B's result
  • Agent B is waiting for Agent A's result

Prevention: maintain a dependency graph and detect cycles before dispatching. Most hierarchical systems prevent deadlocks by design (parent always waits on children, never vice versa).

Semantic Failure Detection

The hardest failure mode to catch. Strategies:

  • Output schema validation: Reject malformed outputs early
  • Confidence scoring: Model estimates its own uncertainty
  • Critic agents: Dedicated verification agents review outputs
  • Automated testing: For code tasks, run tests and check output

Observability

What to Log

Every agent interaction should emit:

  • Input prompt (or hash for large inputs)
  • Output (or hash)
  • Latency
  • Token count (input + output)
  • Cost
  • Model version
  • Success/failure status
  • Task and parent task IDs

Traces, Not Just Logs

Distributed tracing (OpenTelemetry) across agent calls gives you:

  • End-to-end latency breakdown
  • Which agents are bottlenecks
  • Where failures cascade
  • Full replay of any task

Cost Attribution

Multi-agent systems can be expensive to operate. Track cost per:

  • Task type
  • Customer/user
  • Agent role (orchestrator is typically cheap, worker agents expensive)
  • Failure mode (retries cost money)

Production Lessons

1. Simpler architectures first

The orchestrator-subagent pattern solves 80% of use cases. Only add complexity when simpler architectures demonstrably fail.

2. Context window management is critical

Each agent's context window is finite. Design your information architecture so agents receive only what they need. Use summarization liberally.

3. Human escalation paths are essential

For high-stakes tasks, always provide a path to escalate to a human when agent confidence is low or retries are exhausted.

4. Test with adversarial inputs

Multi-agent systems can amplify prompt injection attacks — one agent's malformed output becomes another's trusted input. Test your systems for injection vulnerabilities.

5. Async everything

Long-running agent tasks should be async by default. Synchronous multi-agent calls lead to timeouts, connection drops, and poor user experience.

Conclusion

Multi-agent LLM systems are powerful but add substantial engineering complexity. The most successful production deployments start simple — often a single orchestrator with 2-3 specialized subagents — and add complexity only when it's clearly warranted.

Invest heavily in observability, structured communication, and failure handling before scaling up the number of agents or the task complexity.


Explore more AI system design patterns in our comprehensive curriculum.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.