Tools are the agent's hands — and the agent-computer interface (ACI) is as much a design surface as a human UI. Master function-calling mechanics, the bash-vs-dedicated-tool decision, and MCP, the protocol that turned tools into a plug-in ecosystem. Then assemble the tool surface of a Codex/Claude-Code-style coding agent.
ai-eng-wiki/examples/agents/mcp_server.pyA model with no tools can only emit text. Tools are how an agent reads the world and changes it — search, query a database, run code, send an email, edit a file. The mechanism is function calling (a.k.a. tool use): you describe tools as JSON-Schema'd functions, the model emits a structured request to call one, your code executes it, and you feed the result back. The model never runs anything; it only asks.
That last point is the whole security and design story. Because you execute, you decide what's allowed, what's gated, what's logged, and what's even possible. The set of tools you expose, and how you shape them, is the agent-computer interface (ACI) — and it deserves the same care as a human UI. Anthropic's guidance is blunt: teams spend more time iterating on tools than on prompts (Writing tools for agents).
MCP (Model Context Protocol) is what happened when the industry standardized that interface. Instead of every app hand-writing tools for every integration, MCP defines a wire protocol so any agent can speak to any tool server — the "USB-C of AI tools." By 2026 it's an ecosystem standard (Anthropic donated it to the Linux Foundation's Agentic AI Foundation in late 2025), with first-party support across the major model providers and thousands of community servers.
Tool design separates people who've used agents from people who've shipped them. The signal:
The words first.
get_weather or "search the orders table." Key point: the model itself cannot run it; only your program can.{"city": "Paris"}.Step by step.
Remember this: the model only asks to use tools by emitting structured requests; your code does the running and returns results, and MCP is the shared plug that makes those tools reusable across apps.
A tool is {name, description, input_schema}. The model returns a tool_use block with the chosen tool and validated arguments; you return a tool_result with the output (see the agent loop). Four mechanics worth knowing cold:
tool_choice controls whether the model may/must call a tool: auto (decide), any (must call something), tool (must call this one), none (forbidden). Use tool/any to force structured behavior.disable_parallel_tool_use if a downstream tool depends on another's output.json.loads / JSON.parse, never substring checks.Let's walk a real agent loop. You describe a tool "get_weather": {"city": string, "unit": "C" | "F"} and the user asks "Is Paris warm?" (1) The model reads the schema and request, then outputs a tool_use block: {"tool": "get_weather", "input": {"city": "Paris", "unit": "C"}}. (2) Your harness parses that JSON (critical: json.loads, not string-search), extracts city="Paris" and unit="C", and actually calls the weather API. (3) Say it returns {"temp": 22, "condition": "clear"}. You send back a tool_result with that data. (4) The model reads it and writes the final answer: "Paris is 22°C and clear — yes, warm by most standards." That entire cycle — from user message through schema, tool call (request), execution (your code), result, and final response — is one complete turn. The model never touched the weather API; your harness did.
The model decides whether to call a tool entirely from its description. Recent Opus models reach for tools more conservatively, so a description that states the trigger condition — "Call this when the user asks about current prices or recent events" — measurably outperforms one that only says what the tool does. Treat tool descriptions like you'd treat docstrings the model reads at runtime, because that's exactly what they are.
A single bash tool gives the model enormous reach: it can do almost anything by composing shell commands. But it hands your harness an opaque string — the same shape for ls as for rm -rf /. A dedicated tool (edit_file, run_tests, send_email) gives the harness a typed, intercept-able hook. The rule of thumb that lands in interviews:
Start with bash for breadth. Promote an action to a dedicated tool when you need to gate, render, audit, or parallelize it.
The four promotion triggers, each with a concrete reason:
| Trigger | Why bash can't do it |
|---|---|
| Security boundary | A send_email tool can be confirmed before sending; bash -c "curl -X POST..." can't be gated. Reversibility is the test — hard-to-reverse actions become dedicated, gated tools. |
| Staleness check | A dedicated edit tool can reject a write if the file changed since the model last read it. Bash can't enforce that invariant. |
| Rendering | Some actions need custom UI (a confirmation modal, a diff view). Claude Code makes "ask the user" a tool so it can render a modal and block the loop. |
| Scheduling | Read-only tools (grep, glob) can be marked parallel-safe; bash can't tell a safe grep from an unsafe git push, so it must serialize. |
This bash-vs-tools answer, with the reversibility framing, is one of the highest-signal things you can say in an agent interview.
Ad-hoc function calling has a scaling problem: every agent re-implements tools for every system (your GitHub tool, my GitHub tool, their GitHub tool). MCP fixes it by standardizing the interface so tools become reusable servers:
That's the leverage: a Notion MCP server, a Postgres MCP server, a GitHub MCP server — built once, usable by Claude, by the OpenAI Agents SDK, by your own harness. By 2026 it's the default integration layer, with governance under the Linux Foundation and adoption across the major labs.
A minimal MCP server is tiny. The full runnable version is in ai-eng-wiki/examples/agents/mcp_server.py:
# pip install "mcp[cli]"
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("docs")
@mcp.tool()
def search_docs(query: str) -> str:
"""Search the internal docs. Call this whenever the answer depends on
company-specific information you don't already know."""
# ... your real retrieval here ...
return f"Top result for '{query}': ..."
@mcp.resource("doc://{slug}")
def get_doc(slug: str) -> str:
"""Return the full text of a doc by slug."""
return load_doc(slug)
if __name__ == "__main__":
mcp.run() # speaks MCP over stdio; any MCP client can now use theseThe decorator is the schema: FastMCP turns the function signature + docstring into the JSON Schema and description the model sees. Note the docstring is prescriptive about when to call — same principle as §3.2.
Here's what happens when you run that mcp.run(). (1) FastMCP inspects search_docs: it sees a parameter query of type str, and grabs the docstring. (2) It builds a JSON schema: {"name": "search_docs", "description": "Search the internal docs. Call this whenever...", "inputSchema": {"type": "object", "properties": {"query": {"type": "string"}}}}. (3) The server listens on stdio and waits for a client. (4) When an MCP client (like Claude Code) connects, it sends a resources/list request; the server replies with what tools, resources, and prompts it offers. (5) The client (your app) now knows about search_docs with full description and schema — same as if you'd hand-written the JSON. (6) User says "Find my Q2 goals," the model calls search_docs with query="Q2 goals", your function runs the retrieval, returns the result, and the loop continues. One server, any client — that's the MCP leverage.
Codex and Claude Code are, under the hood, ReAct agents with a carefully designed tool surface — exactly the ACI principles above, applied to a repo. The canonical set:
| Tool | Why it's dedicated (not bash) |
Walk through the table: (1) read_file records the version — when the agent reads /app.py, it remembers the line count, maybe a hash. Later, if edit tries to change line 50 but the file now has 45 lines, the edit fails with "file changed" — staleness caught. (2) edit does string-replace, not arbitrary text insertion — "find this exact string, replace with that" — and checks staleness before writing. That prevents the agent from clobbering a concurrent change (e.g., you edited the file while the agent was thinking). It also renders a diff so you see exactly what changed before approving. (3) grep and glob are read-only and parallel-safe: ten parallel grep calls don't interfere, so the harness can fire them all at once (fanout), not serially. (4) bash is kept for breadth — tests, builds, arbitrary commands — but sandboxed: it runs in a container or restricted environment, so rm -rf / fails fast. (5) run_tests is a verification hook: after every edit, the harness auto-runs tests and feeds failures back as observations (the eval loop). This turns "looks done" into "actually works." Result: a tool surface that's hard to misuse, measurably safer, and faster.
| --- | --- |
| read_file | Records what version the agent saw, enabling staleness checks on write. |
| edit (string-replace) | Rejects the edit if the file changed since read — prevents clobbering concurrent changes. Renders as a reviewable diff. |
| grep / glob | Marked parallel-safe and read-only, so many can run at once. |
| bash | Kept for breadth (build, run arbitrary commands) — but sandboxed, with network and write scope limited. |
| run_tests | A verification hook: the harness can auto-run it after edits and feed failures back (the evaluator loop). |
The senior design notes that separate a good answer from a great one:
bash runs in a container or restricted environment with bounded filesystem and network access — the model writing rm -rf should be containable. Untrusted-code execution is a trust-boundary question, not a prompt question.git push, deploy, delete get confirmation; everything else can be auto-approved for flow.Ask yourself: if the model calls this tool and I didn't want it to, can I undo it? (1) Reversible: edit a file — undo is free (git diff, revert). grep — no side effect. run_tests — no harm. Auto-approve. (2) Hard to reverse: git push — it's in the remote, others may pull it, harder to undo. deploy — prod is live. delete a file — data loss if not backed up. Require confirmation — pop a modal, block the loop, wait for you. That's the clean signal: reversibility decides the gate. The harness controls the flow, so you get to decide. A model that can push to main without asking is a red flag; a model that confirms before destructive acts is trustworthy.
tool_result with is_error: true and an informative message — the model recovers far better from "file not found: did you mean X?" than from a stack trace.tool_use request (name + validated args) and stops. Your harness executes the function and returns a tool_result; the model continues. The model never executes anything — which is exactly why you can gate and audit every action.read_file (records the version seen), edit (string-replace with a staleness check + diff rendering), grep/glob (read-only, parallel-safe), run_tests (verification hook), and a sandboxed bash for breadth with bounded fs/network. Keep writes serial in the main agent; fan out read-only exploration to cheaper sub-agents. Gate push/deploy behind confirmation. Run tests after each edit and feed failures back as observations (eval-driven loop). That's the Codex/Claude-Code shape.is_error: true.bash for everything irreversible — no gating, no audit. Promote dangerous actions to dedicated, confirmable tools.Flashcard. Tools = the ACI; the harness executes, the model only requests. Bash for breadth, dedicated tools to gate/enforce/render/parallelize (reversibility is the test). MCP standardizes tools into reusable servers (tools + resources + prompts over stdio/HTTP). A coding agent is a ReAct loop + a careful, sandboxed, verification-driven tool surface.