Multi-Agent LLM Systems: Architecture Patterns for Production

Introduction

Multi-agent LLM systems — where multiple AI models collaborate to complete complex tasks — have moved from research demos to production systems at companies like Cognition, Devin, and numerous enterprise software providers. Building these systems reliably at scale requires careful architectural thinking.

This post covers the core patterns, failure modes, and engineering practices for multi-agent systems in production.

Why Multi-Agent?

Single-agent architectures hit limits when tasks require:

Context beyond a single model's window: A legal review spanning thousands of documents
Parallelism: Running simultaneous research threads
Specialization: Different models optimized for different subtasks (coding vs. analysis vs. writing)
Verification: Independent models checking each other's work
Long-horizon tasks: Multi-step plans where early decisions affect later ones

Multi-agent architectures address these by decomposing work across multiple models.

Core Architectural Patterns

1. Orchestrator-Subagent

The most common pattern: one central orchestrator agent plans and delegates to specialized subagents.

User → [Orchestrator]
            ├── [Research Agent]
            ├── [Code Agent]
            └── [Verification Agent]

Orchestrator responsibilities:

Decompose the task into subtasks
Assign subtasks to appropriate agents
Track progress and handle failures
Synthesize results into a coherent output

Tradeoffs:

Single point of failure (orchestrator)
Bottleneck if orchestrator is slow
Excellent for tasks with clear hierarchical decomposition

Example: A software engineering agent that delegates to a code writer, a test writer, a debugger, and a documentation writer.

2. Peer-to-Peer (Pipeline)

Agents are arranged as a pipeline, each processing and passing output to the next:

[Agent A: Research] → [Agent B: Draft] → [Agent C: Review] → [Agent D: Final]

Advantages:

Simple data flow
Easy to reason about state
Natural fit for sequential refinement workflows

Disadvantages:

No parallelism
Error propagation (A's mistake becomes B's input)
Latency adds up

Best for: document processing pipelines, code generation + review, content creation workflows.

3. Debate / Adversarial

Multiple agents with competing objectives check each other:

[Agent A: Propose solution]
[Agent B: Critique Agent A's solution]
[Agent A: Defend or revise]
[Judge Agent: Evaluate and decide]

Used for: factual verification, risk assessment, legal/financial analysis. Forces the system to surface assumptions and weaknesses.

Production note: This pattern is expensive (2-3x compute per task). Reserve for high-stakes decisions.

4. Parallel Execution + Aggregation

Independent agents work simultaneously, results are aggregated:

                [Agent A: Approach 1]
[Input] ──────── [Agent B: Approach 2] ──── [Aggregator] → Output
                [Agent C: Approach 3]

Natural fit for: best-of-N generation, ensemble methods, research tasks with multiple dimensions.

Engineering consideration: The aggregator itself is a complexity sink — it needs to handle partial failures, disagreements, and synthesis from heterogeneous outputs.

5. Hierarchical Decomposition

Recursive orchestration for complex tasks:

[Top Orchestrator]
    ├── [Sub-Orchestrator 1]
    │       ├── [Worker A]
    │       └── [Worker B]
    └── [Sub-Orchestrator 2]
            ├── [Worker C]
            └── [Worker D]

Scales to arbitrarily complex tasks but adds significant coordination overhead. Works best with strong task decomposition primitives.

Communication Protocols

Message Format Standards

All agents should communicate via a structured format:

class AgentMessage:
    task_id: str          # unique identifier for the task
    sender: str           # agent ID
    recipient: str        # agent ID or "orchestrator"
    message_type: str     # "request" | "result" | "error" | "update"
    content: dict         # task-specific payload
    metadata: dict        # latency, cost, confidence, etc.
    parent_message_id: str  # for tracing

Standardized formats enable:

Logging and observability
Replay and debugging
Protocol evolution without breaking changes

Shared Memory vs. Message Passing

Shared memory (e.g., a vector store all agents can read/write):

Easy to implement
Risk of concurrent writes
Stale reads
Good for: reference data, long-term knowledge

Message passing (each agent only sees its own context):

Explicit data flow
Better isolation
Harder to share large artifacts
Good for: task coordination, status updates

Production systems often combine both: message passing for coordination, shared memory for large artifacts.

State Management

The State Problem

Long-running multi-agent tasks accumulate state that must be:

Persisted (for recovery from failures)
Accessible to the right agents
Consistent (no stale reads leading to duplicate work)

State Hierarchy

Task state (hours-long)
    └── Subtask state (minutes)
            └── Agent turn state (seconds)

Use different storage backends for each:

Task state: database (Postgres, DynamoDB)
Subtask state: Redis or in-memory with checkpointing
Agent turn state: in-context (LLM context window)

Checkpoint and Resume

For tasks that may take hours, agents must be able to resume from failure:

class TaskCheckpoint:
    task_id: str
    completed_subtasks: List[SubtaskResult]
    pending_subtasks: List[Subtask]
    context_snapshot: str  # compressed context
    created_at: datetime

On agent restart, load the latest checkpoint and resume. This requires idempotent operations — re-running a subtask shouldn't cause side effects.

Failure Handling

Types of Failures

Agent failure: Model returns an error or malformed output
Timeout: Agent takes too long
Deadlock: Agents waiting on each other circularly
Semantic failure: Agent returns valid output that's wrong
Context overflow: Accumulated context exceeds model limits

Retry Strategies

def retry_with_backoff(agent_call, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = agent_call()
            if is_valid(result):
                return result
        except AgentError as e:
            if not is_retryable(e):
                raise
            wait = 2 ** attempt + random.random()
            time.sleep(wait)
    # Fallback: simpler agent or human escalation
    return fallback_handler(agent_call)

Deadlock Detection

In orchestrator-subagent systems, deadlocks occur when:

Agent A is waiting for Agent B's result
Agent B is waiting for Agent A's result

Prevention: maintain a dependency graph and detect cycles before dispatching. Most hierarchical systems prevent deadlocks by design (parent always waits on children, never vice versa).

Semantic Failure Detection

The hardest failure mode to catch. Strategies:

Output schema validation: Reject malformed outputs early
Confidence scoring: Model estimates its own uncertainty
Critic agents: Dedicated verification agents review outputs
Automated testing: For code tasks, run tests and check output

Observability

What to Log

Every agent interaction should emit:

Input prompt (or hash for large inputs)
Output (or hash)
Latency
Token count (input + output)
Cost
Model version
Success/failure status
Task and parent task IDs

Traces, Not Just Logs

Distributed tracing (OpenTelemetry) across agent calls gives you:

End-to-end latency breakdown
Which agents are bottlenecks
Where failures cascade
Full replay of any task

Cost Attribution

Multi-agent systems can be expensive to operate. Track cost per:

Task type
Customer/user
Agent role (orchestrator is typically cheap, worker agents expensive)
Failure mode (retries cost money)

Production Lessons

1. Simpler architectures first

The orchestrator-subagent pattern solves 80% of use cases. Only add complexity when simpler architectures demonstrably fail.

2. Context window management is critical

Each agent's context window is finite. Design your information architecture so agents receive only what they need. Use summarization liberally.

3. Human escalation paths are essential

For high-stakes tasks, always provide a path to escalate to a human when agent confidence is low or retries are exhausted.

4. Test with adversarial inputs

Multi-agent systems can amplify prompt injection attacks — one agent's malformed output becomes another's trusted input. Test your systems for injection vulnerabilities.

5. Async everything

Long-running agent tasks should be async by default. Synchronous multi-agent calls lead to timeouts, connection drops, and poor user experience.

Conclusion

Multi-agent LLM systems are powerful but add substantial engineering complexity. The most successful production deployments start simple — often a single orchestrator with 2-3 specialized subagents — and add complexity only when it's clearly warranted.

Invest heavily in observability, structured communication, and failure handling before scaling up the number of agents or the task complexity.

Explore more AI system design patterns in our comprehensive curriculum.