Agent Harnesses: The Infrastructure Your AI Needs to Actually Work
What a harness is, why it matters, and how Claude Code uses one to run autonomous coding sessions
An LLM by itself is a text transformer. You give it tokens; it gives you tokens back. To make it do things — browse the web, write code, call APIs, update a database — you need infrastructure around the model. That infrastructure is the agent harness.
The Problem With Raw LLM Calls
A raw API call is stateless and single-shot. Each call is independent; the model has no memory of previous turns unless you manage that explicitly. It cannot take actions in the world; it can only return text. And it has no mechanism to retry, verify, or correct its own output.
For a simple chatbot that is fine. For an agent that needs to read a file, write a fix, run tests, check the output, and iterate — you need more.
What Is an Agent Harness?
A harness is the loop and scaffolding that turns a language model into an agent. It handles:
The agent loop — perceive → decide → act → observe → repeat
Tool execution — actually running the functions the model requests
Context management — deciding what history to keep as the loop grows
Guardrails — preventing the model from taking actions it shouldn't
Interruption — letting humans intervene at key decision points
The Agent Loop
Every agent harness is a variation of this loop:
async function agentLoop(task: string): Promise<string> {
const messages: Message[] = [{ role: "user", content: task }];
while (true) {
const response = await llm.call({ messages, tools });
if (response.stopReason === "end_turn") {
return response.text; // Model is done
}
if (response.stopReason === "tool_use") {
// Execute every tool the model requested
const toolResults = await Promise.all(
response.toolCalls.map(call => executeTool(call))
);
// Feed results back as the next turn
messages.push(
{ role: "assistant", content: response.content },
{ role: "user", content: toolResults }
);
}
}
}The model runs until it either produces a final answer or requests more tool calls. The harness owns the loop and the tool execution — the model only sees text in and text out.
Tools: The Agent's Hands
A tool is a typed, callable function the model can invoke by name. The harness describes each tool to the model in JSON Schema, executes calls on the model's behalf, and returns the result as the next turn.
const tools = [
{
name: "read_file",
description: "Read the contents of a file from the filesystem",
input_schema: {
type: "object",
properties: {
path: { type: "string", description: "Absolute or relative file path" },
},
required: ["path"],
},
},
{
name: "run_bash",
description: "Execute a bash command and return stdout + stderr",
input_schema: {
type: "object",
properties: {
command: { type: "string" },
timeout_ms: { type: "number", default: 30000 },
},
required: ["command"],
},
},
];Tool design matters more than most people realise. Tools should be narrow and composable. A tool that does five things is harder for the model to use correctly than five tools that each do one thing.
Context Management
Context is the harness's hardest problem. The agent loop naturally accumulates messages — task, tool calls, results, reflections — and contexts get long fast. Long contexts are slow, expensive, and past a point they hurt model performance.
Common strategies:
Sliding window — drop the oldest messages once you hit a token threshold
Summarisation — periodically ask the model to summarise the conversation so far and replace the raw history
Structured memory — store important facts in a separate data structure, inject only what is relevant each turn
Tool result compression — return summaries of large outputs rather than the raw bytes
The Anthropic Agent SDK (used by Claude Code) includes a compaction step: when the context approaches the model's limit, it automatically summarises the history and resumes from the summary. The harness handles the mechanics transparently.
Guardrails and Safety
An agent that can freely run bash commands, write files, and call APIs can cause significant damage if it goes off-script. Guardrails are the harness's enforcement layer.
They come in two flavours:
Pre-execution — inspect the tool call before running it. Block or ask for confirmation if the action matches a risk pattern (e.g. rm -rf, DROP TABLE, git push --force).
Post-execution — inspect the tool result before returning it to the model. Redact secrets, rate-limit, or halt on unexpected output.
async function executeTool(call: ToolCall): Promise<ToolResult> {
// Pre-execution guardrail
if (isDestructive(call)) {
const approved = await requestHumanApproval(call);
if (!approved) return { error: "User denied this action." };
}
const result = await runTool(call);
// Post-execution guardrail
return redactSecrets(result);
}Claude Code as a Case Study
Claude Code is a production agent harness for software engineering tasks. It runs in your terminal and uses the Anthropic Agent SDK as its foundation. How it is built illustrates every concept above:
Agent loop — every task runs until Claude signals it is done or asks for input. The harness drives the turns.
Tools — Read, Edit, Write, Bash, WebFetch, Agent (spawn subagents). Each is narrow; Claude composes them to do complex work.
Context management — /compact triggers a summarisation pass; the harness handles the swap transparently.
Guardrails — destructive Bash commands (rm -rf, force push, drop table) surface a permission prompt. The user approves or denies before execution.
Interruption — the user can reject any tool call mid-loop. The harness feeds the rejection back to the model and the loop continues.
The model itself does not change between a raw API call and a Claude Code session. The harness is what makes the difference.
Building a harness from scratch is tractable for simple agents. As scope grows — more tools, longer loops, multi-agent coordination — reach for the Anthropic Agent SDK or a framework like LangGraph that handles the scaffolding. The model is the smallest part of the system.