When one ReAct loop isn't enough — plan once then execute, fan work out to an orchestrator and workers, and hand off between specialist agents. Plus the honest cost math on multi-agent systems and the senior heuristic for when they actually help.
ai-eng-wiki/examples/agents/react_agent.pyThe ReAct loop re-decides the next action after every observation. That is the right shape for genuinely open-ended tasks, but it has two costs: a model call per step, and a single context window that grows monotonically as the task runs. Two patterns relax those constraints.
Plan–execute front-loads the thinking: one model call produces a plan (an ordered list of steps), then a cheap executor runs the steps, calling the model again only when a step needs judgment. Fewer model calls, lower latency, and the plan is an artifact you can inspect, cache, and resume.
Multi-agent orchestration splits the context: an orchestrator decomposes the task and delegates sub-tasks to worker agents, each with its own clean context window and its own tools, then synthesizes their results. This is how you parallelize, how you keep a 200-step task from drowning in its own history, and how specialist agents (a reviewer, a researcher, a coder) compose.
Both are still loops. You are not learning new magic — you are learning where to put the loop boundaries.
This is the IC5 dividing line. IC3/IC4 can build a loop; the senior signal is choosing the topology and defending it on cost, latency, reliability, and context. Interviewers push on:
The words first.
Step by step.
Remember this: one agent with the right tools beats many agents, unless you're parallelizing independent work or splitting to keep context clean.
The plan is just structured output. Use the model's structured-output mode so the plan is guaranteed parseable, not vibes-in-prose:
from anthropic import Anthropic
client = Anthropic()
PLAN_SCHEMA = {
"type": "object",
"properties": {
"steps": {
"type": "array",
"items": {
"type": "object",
"properties": {
"tool": {"type": "string", "enum": ["search", "fetch", "summarize"]},
"arg": {"type": "string"},
},
"required": ["tool", "arg"],
"additionalProperties": False,
},
}
},
"required": ["steps"],
"additionalProperties": False,
}
def make_plan(task: str) -> list[dict]:
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
messages=[{"role": "user", "content": f"Make a step-by-step tool plan for: {task}"}],
output_config={"format": {"type": "json_schema", "schema": PLAN_SCHEMA}},
)
import json
return json.loads(resp.content[0].text)["steps"]The schema defines what a plan looks like: an array of steps, each with a tool name and an argument. The make_plan() function asks the model to output one, guaranteed-valid JSON structure. The execute() function then loops through the steps and runs each one — no model call, no re-deciding, just run the tool.
Work through an example: task is "research AI agents and write a summary." The model returns:
{"steps": [
{"tool": "search", "arg": "AI agent architecture"},
{"tool": "fetch", "arg": "https://anthropic.com/.../agents"},
{"tool": "summarize", "arg": "...content..."}
]}Then execute() runs: search("AI agent architecture") → got 3 URLs; fetch("https://...") → got 5000 chars of text; summarize(text) → got a 200-word summary. One model call to plan, three tool calls to execute — much cheaper than asking the model after each search "should I fetch something?" That's the win.
def execute(steps: list[dict]) -> list[str]: out = [] for step in steps: # cheap, deterministic — no model call per step out.append(run_tool(step["tool"], step["arg"])) return out
The win: one model call for the plan instead of one per step. The risk: if step 3 depends on what step 2 *returned*, a static plan can't adapt. Two fixes, in increasing power:
1. **Re-plan on failure** — wrap execution in a check; if a step fails or returns something unexpected, call `make_plan` again with the new context. This is the **evaluator–optimizer** workflow pattern.
2. **Hierarchical plan** — the plan's steps are themselves small ReAct agents. You get adaptivity *within* a step and structure *across* steps.
Notice the convergence: add enough re-planning to plan-execute and it becomes ReAct; remove enough adaptivity from ReAct and it becomes a workflow. These are points on a spectrum from *fully predefined* to *fully dynamic*, and the senior move is to sit as far toward "predefined" as the task allows.
### 3.2 Orchestrator–workers
When sub-tasks are **independent**, run them in parallel with isolated context. The orchestrator's job is decompose → dispatch → synthesize:
```python
import concurrent.futures as cf
def orchestrate(task: str) -> str:
subtasks = decompose(task) # 1 model call → ["research X", "research Y", ...]
with cf.ThreadPoolExecutor(max_workers=4) as pool:
results = list(pool.map(worker_agent, subtasks)) # each worker: its own ReAct loop + contextThe orchestrator calls decompose(task) once to split the big job into small, independent pieces. Then it spins up 4 threads, each running a full worker_agent() on a different subtask — each worker has its own fresh context window and its own ReAct loop (it can call tools, get results, re-decide). Once all workers finish, the orchestrator calls synthesize() to merge the results into one answer.
Work through an example: task is "analyze three different AI frameworks." The orchestrator gets three subtasks: ["analyze PyTorch agent libs", "analyze Langchain", "analyze OpenAI SDK"]. Worker 1 runs ReAct on PyTorch (reads docs, tests code, writes notes). Worker 2 runs ReAct on Langchain. Worker 3 runs ReAct on OpenAI SDK. They run at the same time in parallel — if each takes 30 seconds, the whole thing takes ~30 seconds, not 90. Then synthesize merges their notes: "PyTorch is best for X, Langchain for Y, OpenAI SDK for Z." The orchestrator never saw worker 1's 50 intermediate tool calls — only the clean summary. That context isolation is the whole payoff.
return synthesize(task, results) # 1 model call to merge
Each `worker_agent` is a full agent (the loop from the last lesson) with a **fresh context window**. That isolation is the entire point: the orchestrator never sees the workers' intermediate tool spam, only their distilled results, so the orchestrator's context stays small even when the total work is huge. This is exactly how Claude Code's "Explore" sub-agents work — and they run on a *cheaper model* (Haiku) because exploration is bounded, keeping the main loop's expensive model focused.
### 3.3 Handoffs
A **handoff** is an agent calling another agent *as a tool* and transferring control. The OpenAI Agents SDK makes this a first-class primitive: a triage agent routes to a billing agent or a support agent; control (and the conversation) moves over. Handoffs model **routing between specialists** where only one agent is active at a time — contrast with orchestrator–workers, where many run concurrently and report back.
| Pattern | Concurrency | Context | Use when |
| --- | --- | --- | --- |
| **Plan–execute** | sequential | shared | Steps are predictable; you want few model calls. |
| **Orchestrator–workers** | parallel | isolated per worker | Sub-tasks are independent and you want to parallelize / keep context small. |
| **Handoff / routing** | one active at a time | transferred | Specialist agents; the right one depends on the input. |
| **ReAct (single)** | sequential | shared | Open-ended, each step depends on the last, context fits. |
An agent reasons, calls a tool, observes the result, and loops until it can answer — the ReAct cycle as a flowing pipeline.
Multi-agent is not free, and saying so is a senior signal. Three taxes:
So the heuristic: reach for multi-agent only when the task is (a) genuinely parallelizable or (b) context-separable into specialists — and the value clears the ~10–15× token bar. A single agent with good tools beats a multi-agent system for most tasks. "I'd start with one agent and split only where I measure a context or parallelism win" is the answer that lands.
Every worker re-reads its tool definitions (maybe 2000 tokens each). The orchestrator reads every worker's output (maybe 5 workers × 1000 tokens each = 5000). In a single-agent system, you read the tools once, get one output. In a 5-worker system, you've paid for 5 copies of tool reading + 5 copies of output reading. Rough math: single agent with one tool read = 100 tokens base. Five workers with 5 tool reads each + 1 read of their outputs + orchestrator overhead ≈ 100 × 15 = 1500 tokens. That 15× is real.
So: if your task is "find one fact," multi-agent loses (100 tokens vs 1500). If your task is "parallelize 5 independent analyses," you might gain back 3× in wall-clock time (30 seconds vs 150), and the token cost is worth the speed. If your task is "keep context under control on a 500-step task," the isolation win (no "lost in the middle" problem) can be worth the tokens. Only cross that 15× bar if you measure a real win.
Flashcard. Plan-execute = decide once, execute cheap (predictable tasks). Orchestrator-workers = parallel + isolated context (independent sub-tasks). Handoff = route to a specialist (one active at a time). Multi-agent costs ~10–15× tokens — use it only when parallelizable or context-separable.