Building AI Agents
IC4IC5IC6

Planner–Executor & Multi-Agent Orchestration

When one ReAct loop isn't enough — plan once then execute, fan work out to an orchestrator and workers, and hand off between specialist agents. Plus the honest cost math on multi-agent systems and the senior heuristic for when they actually help.

21 min read · 11 sections
Prerequisites: react-agent, function calling / tool use
Runnable: ai-eng-wiki/examples/agents/react_agent.py

1. Quick anchor

The ReAct loop re-decides the next action after every observation. That is the right shape for genuinely open-ended tasks, but it has two costs: a model call per step, and a single context window that grows monotonically as the task runs. Two patterns relax those constraints.

Plan–execute front-loads the thinking: one model call produces a plan (an ordered list of steps), then a cheap executor runs the steps, calling the model again only when a step needs judgment. Fewer model calls, lower latency, and the plan is an artifact you can inspect, cache, and resume.

Multi-agent orchestration splits the context: an orchestrator decomposes the task and delegates sub-tasks to worker agents, each with its own clean context window and its own tools, then synthesizes their results. This is how you parallelize, how you keep a 200-step task from drowning in its own history, and how specialist agents (a reviewer, a researcher, a coder) compose.

Both are still loops. You are not learning new magic — you are learning where to put the loop boundaries.

2. Why interviewers probe this

This is the IC5 dividing line. IC3/IC4 can build a loop; the senior signal is choosing the topology and defending it on cost, latency, reliability, and context. Interviewers push on:

  • Do you reach for multi-agent reflexively (a red flag) or only when the task is parallelizable and context-separable (the senior answer)?
  • Can you quantify the cost? Multi-agent systems can burn ~15× the tokens of a single chat (Anthropic's multi-agent research write-up). If you don't know that order of magnitude, you haven't run one.
  • Can you draw a real topology — orchestrator, workers, handoffs, shared vs isolated context — for a concrete system (a coding agent, a research agent) and justify every edge?

3. Concept build-up

Beginner explainerNew here? Start here.

The words first.

  • Plan–execute — decide all steps at once (one model call), then run them cheaply without re-deciding.
  • Orchestrator — the agent that breaks a big task into smaller pieces (sub-tasks) and sends them to workers.
  • Worker — an agent that takes one sub-task, runs its own loop, and returns a clean answer to the orchestrator.
  • Handoff — one agent calls another agent (like a function) and transfers control and context.
  • Context window — the memory a model has; splitting into agents gives each agent fresh memory.
  • Token multiplier — multi-agent systems use ~15× more tokens than one agent doing the same work.

Step by step.

  1. Choose your shape: ReAct (one loop, adapts each step), plan-execute (one plan, cheap execution), or multi-agent (many loops, isolated context).
  2. Ask: are the sub-tasks independent (can they run in parallel), or does each depend on the result of the last?
  3. If independent and parallelizable, sketch the orchestrator–worker topology (who decides, who works, how results merge).
  4. If sequential but predictable, a static plan saves tokens; if sequential and uncertain, stay with ReAct.
  5. Count the tokens: 15× is the rough multiplier for multi-agent; only worth it if you gain parallelism or cut context bloat.
  6. Design the handoff (if any): which agent calls which, and what context moves with it?

Remember this: one agent with the right tools beats many agents, unless you're parallelizing independent work or splitting to keep context clean.

3.1 Plan–execute

The plan is just structured output. Use the model's structured-output mode so the plan is guaranteed parseable, not vibes-in-prose:

from anthropic import Anthropic
client = Anthropic()
 
PLAN_SCHEMA = {
    "type": "object",
    "properties": {
        "steps": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "tool": {"type": "string", "enum": ["search", "fetch", "summarize"]},
                    "arg": {"type": "string"},
                },
                "required": ["tool", "arg"],
                "additionalProperties": False,
            },
        }
    },
    "required": ["steps"],
    "additionalProperties": False,
}
 
def make_plan(task: str) -> list[dict]:
    resp = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        messages=[{"role": "user", "content": f"Make a step-by-step tool plan for: {task}"}],
        output_config={"format": {"type": "json_schema", "schema": PLAN_SCHEMA}},
    )
    import json
    return json.loads(resp.content[0].text)["steps"]
Plan–execute: one call to decide, cheap execution after

The schema defines what a plan looks like: an array of steps, each with a tool name and an argument. The make_plan() function asks the model to output one, guaranteed-valid JSON structure. The execute() function then loops through the steps and runs each one — no model call, no re-deciding, just run the tool.

Work through an example: task is "research AI agents and write a summary." The model returns:

{"steps": [
  {"tool": "search", "arg": "AI agent architecture"},
  {"tool": "fetch", "arg": "https://anthropic.com/.../agents"},
  {"tool": "summarize", "arg": "...content..."}
]}

Then execute() runs: search("AI agent architecture") → got 3 URLs; fetch("https://...") → got 5000 chars of text; summarize(text) → got a 200-word summary. One model call to plan, three tool calls to execute — much cheaper than asking the model after each search "should I fetch something?" That's the win.

def execute(steps: list[dict]) -> list[str]: out = [] for step in steps: # cheap, deterministic — no model call per step out.append(run_tool(step["tool"], step["arg"])) return out

 
The win: one model call for the plan instead of one per step. The risk: if step 3 depends on what step 2 *returned*, a static plan can't adapt. Two fixes, in increasing power:
 
1. **Re-plan on failure** — wrap execution in a check; if a step fails or returns something unexpected, call `make_plan` again with the new context. This is the **evaluator–optimizer** workflow pattern.
2. **Hierarchical plan** — the plan's steps are themselves small ReAct agents. You get adaptivity *within* a step and structure *across* steps.
 
Notice the convergence: add enough re-planning to plan-execute and it becomes ReAct; remove enough adaptivity from ReAct and it becomes a workflow. These are points on a spectrum from *fully predefined* to *fully dynamic*, and the senior move is to sit as far toward "predefined" as the task allows.
 
### 3.2 Orchestrator–workers
 
When sub-tasks are **independent**, run them in parallel with isolated context. The orchestrator's job is decompose → dispatch → synthesize:
 
```python
import concurrent.futures as cf
 
def orchestrate(task: str) -> str:
    subtasks = decompose(task)                       # 1 model call → ["research X", "research Y", ...]
    with cf.ThreadPoolExecutor(max_workers=4) as pool:
        results = list(pool.map(worker_agent, subtasks))   # each worker: its own ReAct loop + context
Orchestrator–workers: decompose once, work in parallel, synthesize once

The orchestrator calls decompose(task) once to split the big job into small, independent pieces. Then it spins up 4 threads, each running a full worker_agent() on a different subtask — each worker has its own fresh context window and its own ReAct loop (it can call tools, get results, re-decide). Once all workers finish, the orchestrator calls synthesize() to merge the results into one answer.

Work through an example: task is "analyze three different AI frameworks." The orchestrator gets three subtasks: ["analyze PyTorch agent libs", "analyze Langchain", "analyze OpenAI SDK"]. Worker 1 runs ReAct on PyTorch (reads docs, tests code, writes notes). Worker 2 runs ReAct on Langchain. Worker 3 runs ReAct on OpenAI SDK. They run at the same time in parallel — if each takes 30 seconds, the whole thing takes ~30 seconds, not 90. Then synthesize merges their notes: "PyTorch is best for X, Langchain for Y, OpenAI SDK for Z." The orchestrator never saw worker 1's 50 intermediate tool calls — only the clean summary. That context isolation is the whole payoff.

return synthesize(task, results)                 # 1 model call to merge
 
Each `worker_agent` is a full agent (the loop from the last lesson) with a **fresh context window**. That isolation is the entire point: the orchestrator never sees the workers' intermediate tool spam, only their distilled results, so the orchestrator's context stays small even when the total work is huge. This is exactly how Claude Code's "Explore" sub-agents work — and they run on a *cheaper model* (Haiku) because exploration is bounded, keeping the main loop's expensive model focused.
 
### 3.3 Handoffs
 
A **handoff** is an agent calling another agent *as a tool* and transferring control. The OpenAI Agents SDK makes this a first-class primitive: a triage agent routes to a billing agent or a support agent; control (and the conversation) moves over. Handoffs model **routing between specialists** where only one agent is active at a time — contrast with orchestrator–workers, where many run concurrently and report back.
 
| Pattern | Concurrency | Context | Use when |
| --- | --- | --- | --- |
| **Plan–execute** | sequential | shared | Steps are predictable; you want few model calls. |
| **Orchestrator–workers** | parallel | isolated per worker | Sub-tasks are independent and you want to parallelize / keep context small. |
| **Handoff / routing** | one active at a time | transferred | Specialist agents; the right one depends on the input. |
| **ReAct (single)** | sequential | shared | Open-ended, each step depends on the last, context fits. |
 
◇ Live illustrationThe agent loop

An agent reasons, calls a tool, observes the result, and loops until it can answer — the ReAct cycle as a flowing pipeline.

4. The honest cost math

Multi-agent is not free, and saying so is a senior signal. Three taxes:

  1. Token multiplier. Every worker re-reads its instructions and tools; the orchestrator reads every worker's output. Anthropic reported their multi-agent research system used ~15× the tokens of a normal chat. Budget for it.
  2. Coordination failure. More agents = more places for a handoff to drop context, a worker to misunderstand its sub-task, or results to conflict. Reliability is the product of each agent's reliability — five 95%-reliable agents in series is ~77%.
  3. Latency floor. Parallel workers help, but the orchestrator's decompose and synthesize calls are serial bookends you can't remove.

So the heuristic: reach for multi-agent only when the task is (a) genuinely parallelizable or (b) context-separable into specialists — and the value clears the ~10–15× token bar. A single agent with good tools beats a multi-agent system for most tasks. "I'd start with one agent and split only where I measure a context or parallelism win" is the answer that lands.

The token multiplier: why 15×?

Every worker re-reads its tool definitions (maybe 2000 tokens each). The orchestrator reads every worker's output (maybe 5 workers × 1000 tokens each = 5000). In a single-agent system, you read the tools once, get one output. In a 5-worker system, you've paid for 5 copies of tool reading + 5 copies of output reading. Rough math: single agent with one tool read = 100 tokens base. Five workers with 5 tool reads each + 1 read of their outputs + orchestrator overhead ≈ 100 × 15 = 1500 tokens. That 15× is real.

So: if your task is "find one fact," multi-agent loses (100 tokens vs 1500). If your task is "parallelize 5 independent analyses," you might gain back 3× in wall-clock time (30 seconds vs 150), and the token cost is worth the speed. If your task is "keep context under control on a 500-step task," the isolation win (no "lost in the middle" problem) can be worth the tokens. Only cross that 15× bar if you measure a real win.

5. Production tradeoffs

  • Context isolation is the real prize. Even without parallelism, splitting a long task into sub-agents keeps each context window small and on-topic — which improves quality (less "lost in the middle") and cost (less to re-read each turn).
  • Pick the worker model per task. Bounded, well-specified sub-tasks (search, extract, lint) run fine on Haiku/Sonnet; reserve Opus for the orchestrator's judgment calls. This is the single biggest cost lever.
  • Make handoffs explicit and logged. A handoff that silently drops the user's original intent is the classic multi-agent bug; pass intent forward explicitly and trace every transfer.
  • Synthesize, don't concatenate. The orchestrator's final call should reconcile worker outputs (resolve conflicts, dedupe), not staple them together.
  • Durability. A 30-minute multi-agent run that dies at minute 28 is a disaster if you restart from zero. Checkpoint the plan and completed steps so you can resume — durable execution is its own topic in Workflows.

6. How it's asked

[IC4] "Plan-execute vs ReAct — what does planning up front buy, and what does it cost?" Buys: far fewer model calls (one plan vs one-per-step), lower latency, and an inspectable/cacheable/resumable plan artifact. Costs: the plan is static, so if a later step depends on an earlier step's result, the plan can be wrong — you pay for re-planning or hierarchical steps to recover adaptivity. Pick plan-execute for predictable pipelines, ReAct for open-ended tasks.
[IC5] "When does multi-agent beat a single agent, and when is it a token bonfire?" Beats when sub-tasks are independent (parallelize for latency) or the task is context-separable into specialists (isolation improves quality and keeps context small) — e.g. "research these 8 topics in parallel," or "one agent writes code, another reviews it." It's a bonfire when the task is inherently sequential and shares context: you pay the ~15× token multiplier and the coordination-failure tax for nothing. Default to one agent; split only where you can name the parallelism or isolation win.
[IC6] "Design the agent topology for an AI coding assistant editing a large repo." Main loop: a ReAct coding agent with edit/read/bash/test tools and the user's goal. Fan out read-only exploration to parallel sub-agents (find call sites, read related files) on a cheaper model — they return distilled summaries so the main context isn't flooded with file dumps. Keep writes serial and in the main agent (parallel edits conflict). Gate hard-to-reverse actions (push, deploy) behind confirmation. Add an evaluator step: run tests after each change; on failure, feed the diff + failure back as an observation (ReAct recovery). Checkpoint after each successful edit for durability. This is essentially the Claude Code / Codex harness — covered in depth in Harness Engineering.

7. Pitfalls & flashcards

  • Reaching for multi-agent by default. It's the exception, not the rule. Justify it with parallelism or context isolation.
  • A static plan with data-dependent steps — silently wrong. Add re-planning or make steps mini-agents.
  • Handoffs that drop the original intent. Pass it forward explicitly; log every transfer.
  • All workers on the flagship model. Match model to sub-task difficulty — the orchestrator gets Opus, the lint worker gets Haiku.
  • No checkpointing on long runs — one failure wastes the whole run.

Flashcard. Plan-execute = decide once, execute cheap (predictable tasks). Orchestrator-workers = parallel + isolated context (independent sub-tasks). Handoff = route to a specialist (one active at a time). Multi-agent costs ~10–15× tokens — use it only when parallelizable or context-separable.

8. Further reading

Primary sources
← More in Building AI Agents