IC4IC5IC6

Tool Use, Function Calling & MCP

Tools are the agent's hands — and the agent-computer interface (ACI) is as much a design surface as a human UI. Master function-calling mechanics, the bash-vs-dedicated-tool decision, and MCP, the protocol that turned tools into a plug-in ecosystem. Then assemble the tool surface of a Codex/Claude-Code-style coding agent.

22 min read · 12 sections

Prerequisites: react-agent, JSON Schema basics

Runnable: ai-eng-wiki/examples/agents/mcp_server.py

1. Quick anchor

A model with no tools can only emit text. Tools are how an agent reads the world and changes it — search, query a database, run code, send an email, edit a file. The mechanism is function calling (a.k.a. tool use): you describe tools as JSON-Schema'd functions, the model emits a structured request to call one, your code executes it, and you feed the result back. The model never runs anything; it only asks.

That last point is the whole security and design story. Because you execute, you decide what's allowed, what's gated, what's logged, and what's even possible. The set of tools you expose, and how you shape them, is the agent-computer interface (ACI) — and it deserves the same care as a human UI. Anthropic's guidance is blunt: teams spend more time iterating on tools than on prompts (Writing tools for agents).

MCP (Model Context Protocol) is what happened when the industry standardized that interface. Instead of every app hand-writing tools for every integration, MCP defines a wire protocol so any agent can speak to any tool server — the "USB-C of AI tools." By 2026 it's an ecosystem standard (Anthropic donated it to the Linux Foundation's Agentic AI Foundation in late 2025), with first-party support across the major model providers and thousands of community servers.

2. Why interviewers probe this

Tool design separates people who've used agents from people who've shipped them. The signal:

IC3/IC4 — Do you understand that the harness executes tools, the model only requests? Can you write a clean tool schema with a description that makes the model call it correctly?
IC5 — The ACI design questions: bash vs dedicated tools, how to gate side effects, how to make a tool hard to misuse (poka-yoke). And: what does MCP buy over ad-hoc function calling?
IC6+ — Can you design the full tool surface of a real agent (a coding assistant) with sandboxing, staleness checks, and reversibility baked in?

3. Concept build-up

Beginner explainerNew here? The words first

The words first.

Tool (a.k.a. function) — a capability the model can ask to use, like get_weather or "search the orders table." Key point: the model itself cannot run it; only your program can.
Tool schema — a written description you hand the model: the tool's name, what it does, and the shape of its inputs (a JSON Schema listing each parameter and its type).
Tool call — the model's structured output that says "run this tool with these arguments," given as JSON, not prose. It is a request, not an action.
Arguments — the input values the model fills into the schema, e.g. {"city": "Paris"}.
Tool result — the output your code sends back to the model after actually running the tool.
The agent loop — repeating call to run to result to continue, until the model stops asking for tools and writes a final answer.
MCP (Model Context Protocol) — an open standard for exposing tools (and data) so any app or model can discover and call them the same way.
MCP server / client — a server publishes tools over the protocol; the client (your app) connects, lists what is available, and calls them.

Step by step.

You send the model the user's message plus a list of tool schemas.
The model either answers directly or emits a tool call naming one tool and its arguments.
Your code catches that request and actually runs the function: hits the API, queries the DB, etc.
You send the output back as a tool result.
The model reads the result and continues: maybe another tool call, maybe the final answer.
MCP standardizes steps 1-4 so one tool server plugs into any client without custom glue.

Remember this: the model only asks to use tools by emitting structured requests; your code does the running and returns results, and MCP is the shared plug that makes those tools reusable across apps.

3.1 Function-calling mechanics

A tool is {name, description, input_schema}. The model returns a tool_use block with the chosen tool and validated arguments; you return a tool_result with the output (see the agent loop). Four mechanics worth knowing cold:

tool_choice controls whether the model may/must call a tool: auto (decide), any (must call something), tool (must call this one), none (forbidden). Use tool/any to force structured behavior.
Parallel tool use. The model can request several tools in one turn (look up the customer and fetch their orders simultaneously). Return all results in one user turn. Set disable_parallel_tool_use if a downstream tool depends on another's output.
Strict schemas. With strict structured tool use, arguments are guaranteed to validate against your schema — no more defensive parsing of malformed JSON.
Always parse, never string-match. Tool-call inputs are JSON; escaping can vary by model. Parse with json.loads / JSON.parse, never substring checks.

✎ Function calling: a complete turn

Let's walk a real agent loop. You describe a tool "get_weather": {"city": string, "unit": "C" | "F"} and the user asks "Is Paris warm?" (1) The model reads the schema and request, then outputs a tool_use block: {"tool": "get_weather", "input": {"city": "Paris", "unit": "C"}}. (2) Your harness parses that JSON (critical: json.loads, not string-search), extracts city="Paris" and unit="C", and actually calls the weather API. (3) Say it returns {"temp": 22, "condition": "clear"}. You send back a tool_result with that data. (4) The model reads it and writes the final answer: "Paris is 22°C and clear — yes, warm by most standards." That entire cycle — from user message through schema, tool call (request), execution (your code), result, and final response — is one complete turn. The model never touched the weather API; your harness did.

3.2 The description is the API

The model decides whether to call a tool entirely from its description. Recent Opus models reach for tools more conservatively, so a description that states the trigger condition — "Call this when the user asks about current prices or recent events" — measurably outperforms one that only says what the tool does. Treat tool descriptions like you'd treat docstrings the model reads at runtime, because that's exactly what they are.

3.3 Bash vs dedicated tools — the central ACI decision

A single bash tool gives the model enormous reach: it can do almost anything by composing shell commands. But it hands your harness an opaque string — the same shape for ls as for rm -rf /. A dedicated tool (edit_file, run_tests, send_email) gives the harness a typed, intercept-able hook. The rule of thumb that lands in interviews:

Start with bash for breadth. Promote an action to a dedicated tool when you need to gate, render, audit, or parallelize it.

The four promotion triggers, each with a concrete reason:

Trigger	Why bash can't do it
Security boundary	A `send_email` tool can be confirmed before sending; `bash -c "curl -X POST..."` can't be gated. Reversibility is the test — hard-to-reverse actions become dedicated, gated tools.
Staleness check	A dedicated `edit` tool can reject a write if the file changed since the model last read it. Bash can't enforce that invariant.
Rendering	Some actions need custom UI (a confirmation modal, a diff view). Claude Code makes "ask the user" a tool so it can render a modal and block the loop.
Scheduling	Read-only tools (`grep`, `glob`) can be marked parallel-safe; bash can't tell a safe `grep` from an unsafe `git push`, so it must serialize.

This bash-vs-tools answer, with the reversibility framing, is one of the highest-signal things you can say in an agent interview.

3.4 MCP — tools as a protocol

Ad-hoc function calling has a scaling problem: every agent re-implements tools for every system (your GitHub tool, my GitHub tool, their GitHub tool). MCP fixes it by standardizing the interface so tools become reusable servers:

An MCP server exposes tools (callable functions), resources (readable data, like files or rows), and prompts (reusable templates).
An MCP client (inside the agent host) connects over a transport (stdio for local, Streamable HTTP for remote) and discovers what the server offers.
Any MCP-speaking agent can use any MCP server. Write a server once; every agent gets the integration.

That's the leverage: a Notion MCP server, a Postgres MCP server, a GitHub MCP server — built once, usable by Claude, by the OpenAI Agents SDK, by your own harness. By 2026 it's the default integration layer, with governance under the Linux Foundation and adoption across the major labs.

A minimal MCP server is tiny. The full runnable version is in ai-eng-wiki/examples/agents/mcp_server.py:

# pip install "mcp[cli]"
from mcp.server.fastmcp import FastMCP
 
mcp = FastMCP("docs")
 
@mcp.tool()
def search_docs(query: str) -> str:
    """Search the internal docs. Call this whenever the answer depends on
    company-specific information you don't already know."""
    # ... your real retrieval here ...
    return f"Top result for '{query}': ..."
 
@mcp.resource("doc://{slug}")
def get_doc(slug: str) -> str:
    """Return the full text of a doc by slug."""
    return load_doc(slug)
 
if __name__ == "__main__":
    mcp.run()   # speaks MCP over stdio; any MCP client can now use these

The decorator is the schema: FastMCP turns the function signature + docstring into the JSON Schema and description the model sees. Note the docstring is prescriptive about when to call — same principle as §3.2.

✎ MCP server: from decorator to discovery

Here's what happens when you run that mcp.run(). (1) FastMCP inspects search_docs: it sees a parameter query of type str, and grabs the docstring. (2) It builds a JSON schema: {"name": "search_docs", "description": "Search the internal docs. Call this whenever...", "inputSchema": {"type": "object", "properties": {"query": {"type": "string"}}}}. (3) The server listens on stdio and waits for a client. (4) When an MCP client (like Claude Code) connects, it sends a resources/list request; the server replies with what tools, resources, and prompts it offers. (5) The client (your app) now knows about search_docs with full description and schema — same as if you'd hand-written the JSON. (6) User says "Find my Q2 goals," the model calls search_docs with query="Q2 goals", your function runs the retrieval, returns the result, and the loop continues. One server, any client — that's the MCP leverage.

4. Building a coding agent's tool surface

Codex and Claude Code are, under the hood, ReAct agents with a carefully designed tool surface — exactly the ACI principles above, applied to a repo. The canonical set:

| Tool | Why it's dedicated (not bash) |

✎ The coding agent's tool surface: each tool's job

Walk through the table: (1) read_file records the version — when the agent reads /app.py, it remembers the line count, maybe a hash. Later, if edit tries to change line 50 but the file now has 45 lines, the edit fails with "file changed" — staleness caught. (2) edit does string-replace, not arbitrary text insertion — "find this exact string, replace with that" — and checks staleness before writing. That prevents the agent from clobbering a concurrent change (e.g., you edited the file while the agent was thinking). It also renders a diff so you see exactly what changed before approving. (3) grep and glob are read-only and parallel-safe: ten parallel grep calls don't interfere, so the harness can fire them all at once (fanout), not serially. (4) bash is kept for breadth — tests, builds, arbitrary commands — but sandboxed: it runs in a container or restricted environment, so rm -rf / fails fast. (5) run_tests is a verification hook: after every edit, the harness auto-runs tests and feeds failures back as observations (the eval loop). This turns "looks done" into "actually works." Result: a tool surface that's hard to misuse, measurably safer, and faster.

| --- | --- | | read_file | Records what version the agent saw, enabling staleness checks on write. | | edit (string-replace) | Rejects the edit if the file changed since read — prevents clobbering concurrent changes. Renders as a reviewable diff. | | grep / glob | Marked parallel-safe and read-only, so many can run at once. | | bash | Kept for breadth (build, run arbitrary commands) — but sandboxed, with network and write scope limited. | | run_tests | A verification hook: the harness can auto-run it after edits and feed failures back (the evaluator loop). |

The senior design notes that separate a good answer from a great one:

Sandboxing. bash runs in a container or restricted environment with bounded filesystem and network access — the model writing rm -rf should be containable. Untrusted-code execution is a trust-boundary question, not a prompt question.
Multi-file edits stay serial in the main agent (parallel edits conflict), while read-only exploration fans out to sub-agents (last lesson).
Verification is a tool, and the loop uses it. Edit → run tests → on failure, feed the diff + failure back as an observation. This eval-driven loop is the difference between a coding agent that looks done and one that is done — the heart of Harness Engineering.
Gate the irreversible. git push, deploy, delete get confirmation; everything else can be auto-approved for flow.

✎ Reversibility: the test for tool gating

Ask yourself: if the model calls this tool and I didn't want it to, can I undo it? (1) Reversible: edit a file — undo is free (git diff, revert). grep — no side effect. run_tests — no harm. Auto-approve. (2) Hard to reverse: git push — it's in the remote, others may pull it, harder to undo. deploy — prod is live. delete a file — data loss if not backed up. Require confirmation — pop a modal, block the loop, wait for you. That's the clean signal: reversibility decides the gate. The harness controls the flow, so you get to decide. A model that can push to main without asking is a red flag; a model that confirms before destructive acts is trustworthy.

5. Production tradeoffs

Tool sprawl confuses the model. Too many tools and the model picks wrong. Keep the active set focused; for large tool libraries, use tool search (the model discovers relevant tools on demand) so you don't stuff 200 schemas into every request — and it preserves the prompt cache because schemas are appended, not swapped.
Tool errors are part of the contract. Return failures as tool_result with is_error: true and an informative message — the model recovers far better from "file not found: did you mean X?" than from a stack trace.
MCP auth lives outside the tool definition. A server declares what it does; credentials are injected by the host (vaults/proxies), never baked into the schema or the prompt — secrets in prompts persist in history.
Token-shape your outputs. A tool that returns 50KB of JSON blows the context budget. Filter/paginate/summarize before the result hits the model — or use programmatic tool calling so intermediate results never enter the context at all.

6. How it's asked

[IC3] "How does function calling work end to end? Who runs the tool?" You describe tools as JSON-schema functions and pass them with the request. The model emits a structured tool_use request (name + validated args) and stops. Your harness executes the function and returns a tool_result; the model continues. The model never executes anything — which is exactly why you can gate and audit every action.

[IC5] "Bash tool vs a dozen dedicated tools — how do you decide?" Start with bash for breadth — it can do anything. Promote an action to a dedicated tool when you need to gate it (irreversible side effects → confirmation), enforce an invariant (staleness check on edit), render it (diff/modal UI), or parallelize it (mark read-only tools parallel-safe). Reversibility is the cleanest test: hard-to-undo actions become dedicated, gated tools; cheap reversible ones can stay in bash.

[IC5] "What does MCP solve that ad-hoc function calling didn't?" Standardization and reuse. Ad-hoc tools are re-implemented per agent per integration. MCP defines a wire protocol (tools/resources/prompts over stdio or HTTP) so a tool server is written once and usable by any MCP-speaking agent. It turned integrations from N×M bespoke code into a plug-in ecosystem, with shared auth/discovery patterns — now an industry standard under the Linux Foundation.

[IC6] "Design the tool surface for a coding agent editing a real repo safely." read_file (records the version seen), edit (string-replace with a staleness check + diff rendering), grep/glob (read-only, parallel-safe), run_tests (verification hook), and a sandboxed bash for breadth with bounded fs/network. Keep writes serial in the main agent; fan out read-only exploration to cheaper sub-agents. Gate push/deploy behind confirmation. Run tests after each edit and feed failures back as observations (eval-driven loop). That's the Codex/Claude-Code shape.

7. Pitfalls & flashcards

Vague tool descriptions → the model calls the wrong tool or not at all. Be prescriptive about when to call.
String-matching tool inputs instead of parsing JSON — breaks on escaping differences.
Raising on tool failure instead of returning is_error: true.
Putting secrets in tool schemas or prompts — they persist in history; inject via the host's credential layer.
One mega-bash for everything irreversible — no gating, no audit. Promote dangerous actions to dedicated, confirmable tools.
Dumping huge tool outputs into context — filter/paginate first.

Flashcard. Tools = the ACI; the harness executes, the model only requests. Bash for breadth, dedicated tools to gate/enforce/render/parallelize (reversibility is the test). MCP standardizes tools into reusable servers (tools + resources + prompts over stdio/HTTP). A coding agent is a ReAct loop + a careful, sandboxed, verification-driven tool surface.

8. Further reading

Model Context Protocol — spec, SDKs, server gallery (modelcontextprotocol.io).
Writing tools for agents — Anthropic's ACI design guidance (anthropic.com/engineering).
Claude tool-use & agent-design docs (platform.claude.com).
Next pillar: RAG & Retrieval — the most common tool an agent reaches for: search over your own knowledge.

Primary sources

← More in Building AI Agents