Anatomy of a Harness: Lessons from Claude Code's Source
In March 2026, Claude Code's source code became publicly visible. For the first time, we could study the internals of the most capable AI harness in the world. Here is what we found, and what it teaches practitioners about building their own systems.
What Happened
In late March 2026, the full TypeScript source code for Claude Code (Anthropic's agentic coding tool) surfaced publicly via community mirrors on GitHub. The codebase is roughly 800,000 lines in its main module alone, with over 50 directories covering tools, hooks, skills, memory, context assembly, state management, plugins, and the core agent loop.
For anyone who has read our Harness Engineering article, this is an extraordinary opportunity. That article argued that the code wrapped around an AI model is just as important as the model itself, and cited the MetaHarness research showing 6x performance gaps from harness variations alone. Now we can see exactly how the best harness in the world is built. Not in theory. In source code.
What follows is a deep architectural analysis of Claude Code's harness, mapped to the concepts and frameworks that AAS practitioners use every day. If you are building a Personal Jarvis, designing games for agents, or thinking about self-improving systems, this is the engineering behind the curtain.
The Big Picture
Claude Code is not a chatbot with file access. It is a state machine that assembles context, dispatches tools, manages permissions, tracks budgets, and recovers from failures, all wrapped around a single model call in a loop. The model (Claude) provides the intelligence. The harness provides everything else.
The architecture breaks into ten major subsystems:
- The Agent Loop (the heartbeat)
- Context Assembly (what the model sees)
- Tools (what the model can do)
- Hooks (event-driven extensibility)
- Skills (data-driven commands)
- Memory (persistent knowledge)
- Tasks (background work)
- Commands (user interface)
- State (session tracking)
- Plugins (extensible capabilities)
Each one teaches something different about harness engineering. Let's go through them.
1. The Agent Loop Is a State Machine
The most important file in the entire codebase is query.ts. It contains the main agent loop, and it is not recursive. It is a pure state machine.
Each iteration of the loop follows the same pattern:
- Assemble the current state (messages, tools, context, budget)
- Normalize messages for the API (strip internal metadata, reorder attachments, merge thinking blocks)
- Call the model
- Stream the response (thinking blocks, text, tool calls)
- Execute requested tools
- Check continue conditions (budget remaining? stop hooks triggered? end_turn?)
- Loop or exit
The state is split cleanly into two categories: immutable parameters (system prompt, model config, available tools) and mutable state (messages, turn count, budget tracking, auto-compact state). At the start of each iteration, the mutable state is destructured. At the end, it is reconstructed. This prevents bugs from accidental cross-iteration mutation.
Recovery is explicit, not hidden. When the model hits its output token limit, the loop retries up to three times with an increased budget. When the context gets too long, it triggers automatic compaction (summarizing earlier conversation to free space). When a tool fails, it retries. Each recovery path is a visible branch in the state machine, not a try/catch buried somewhere.
Why This Matters for Practitioners
If you are building any kind of agent workflow (for a client, for your own operation, for a product), the lesson is: treat the agent loop as engineering, not magic. The model is one function call inside a larger system. Everything around that call (what context goes in, what happens with tool results, how you handle failures, when you stop) is your responsibility to design.
The MetaHarness paper showed that changing this loop produces a 6x performance gap. Now we can see exactly what "changing the loop" means in practice: it means changing how you assemble context, which tools you offer, when you retry versus stop, and how you manage the token budget.
2. Context Assembly Is Layered and Lazy
Claude Code does not dump everything into the system prompt. It assembles context in layers, each with different lifecycle and caching behavior.
Layer 1: System prompt. The base instructions that define what the model is and how it should behave. This is static within a session. It includes the tool descriptions, behavioral guidelines, and formatting rules.
Layer 2: System context. Runtime state like git branch, recent commits, working directory, and platform info. This is memoized (computed once, cached, and reused). It resets between sessions but stays stable within one.
Layer 3: User context. CLAUDE.md files discovered from the project tree, current date, and user preferences. Also memoized. This is the layer that makes Claude Code project-aware.
Layer 4: Memory attachments. Relevance-filtered files from the ~/.claude/projects/<slug>/memory/ directory. These are prefetched in parallel while the model is streaming its response, so by the time the model needs to call a tool, memory is already loaded. This is a performance optimization that most harnesses miss.
Layer 5: Skill content. Loaded on demand, only when a skill is invoked. The skill index (names and descriptions) loads upfront. The full skill content (the actual instructions) loads only when the model decides to use that skill.
The Economics Are Deliberate
This architecture directly reflects the economics described in our Context Engineering article: "load the minimum sufficient context for the task at hand." Claude Code does not load every CLAUDE.md, every memory file, and every skill on every turn. It loads the base, caches what's stable, prefetches what's likely, and lazy-loads everything else.
The 200-line, 25KB limit on the MEMORY.md index is a hard constraint. If your memory index exceeds this, it gets truncated with a warning. This is not a bug. It is a design choice: the memory index must fit in context without crowding out the actual work.
For MVJ practitioners: Your folder structure is literally the context architecture. When Claude Code starts a session in your project directory, it walks the tree looking for CLAUDE.md files and loads them as context. Every CLAUDE.md you write is an instruction to the harness. Every skill file is a lazy-loaded command. Every memory file is a piece of persistent knowledge that survives across sessions. Structure these files with the same care you would structure a database schema, because they are serving the same function.
3. Tools Are Loosely Coupled Through Dependency Injection
Claude Code ships with over 30 tools: file I/O (Read, Write, Edit, Glob, Grep), execution (Bash), agents (Agent tool for subagents), skills (SkillTool), task management (TaskCreate, TaskUpdate), web access (WebSearch, WebFetch), and more.
Every tool follows the same interface:
- Name and aliases (how the model calls it)
- Input schema (Zod-validated, converted to JSON Schema for the API)
- Execute function (receives input and a
ToolUseContext, returns aToolResult) - Optional prompt and progress functions (for dynamic descriptions and status updates)
The critical design choice: tools receive all their dependencies through ToolUseContext, a shared context object that carries the current state, permission settings, file cache, MCP clients, abort signals, and message store. Tools never import each other. They never import the main loop. They never access global state.
This is dependency injection, and it has three consequences:
- Tools are testable in isolation. You can construct a mock
ToolUseContextand test any tool without running the full agent loop. - Tools are composable. The Agent tool launches subagents that have their own tool sets and contexts. Because tools don't reach into global state, subagents cannot corrupt the parent's state.
- Tools are feature-gatable. A
feature('FLAG')check at load time determines whether a tool is registered. Unused tools are stripped by the bundler. Different users get different tool sets from the same codebase.
What This Teaches About Game Design
The Game Design article describes four components of a well-designed game: objectives, rules, guardrails, and scoring. In Claude Code's tool system, you can see each one:
- Objectives are encoded in the tool descriptions (what each tool is for, when to use it)
- Rules are encoded in the input schemas (what parameters are valid, what combinations are allowed)
- Guardrails are encoded in the permission system (
canUseTool()gates every execution with user-defined allow/deny rules) - Scoring is encoded in the budget tracker (token costs, task budgets, cost limits)
The model plays the game. The tools define the playing field. The permission system enforces the boundaries. This is game design implemented in code.
4. The Permission System Is Intent Engineering in Code
Before any tool executes, it passes through canUseTool(). This function checks the tool call against three rule sets:
- Always allow rules: Actions the user has pre-approved (e.g., "always allow Read on any file in this project")
- Always deny rules: Actions the user has forbidden (e.g., "never allow Bash commands with
rm -rf") - Always ask rules: Actions that require explicit approval each time
Hooks can intercept this process and auto-approve or auto-deny via structured JSON responses. This means organizations can encode their intent into hook configurations: "when an agent tries to push to main, always ask." "When an agent reads a file in the project directory, always allow." "When an agent tries to install a package, check against the approved list."
This is exactly the Intent Engineering pattern: organizational values translated into decision boundaries that agents respect autonomously. The Klarna example from that article (AI optimizing for the wrong goal because nobody encoded the right goal) is prevented here by making intent explicit in the permission layer.
For Practitioners Building Client Systems
When you are deploying agents for a client, the permission layer is not an afterthought. It is where you encode the client's risk tolerance, compliance requirements, and operational boundaries. "The agent can draft emails but cannot send them." "The agent can read financial data but cannot modify it." "The agent can suggest code changes but a human must approve the commit."
These are not technical constraints. They are business decisions expressed as code. And they compound: a well-configured permission layer means the client can give the agent more autonomy over time, because the boundaries are explicit and auditable.
5. Skills Are Specs, Not Code
This is one of the most important insights from the source code, and it directly validates The Spec Is the Product.
Skills in Claude Code are markdown files with YAML frontmatter. They are not TypeScript. They are not compiled. They are plain text documents that describe a workflow, and the model follows them.
A skill file contains:
- Name and description (for discovery and matching)
- When to use (triggers and relevance criteria)
- Allowed tools (which tools the skill can access)
- Model override (optionally run on a different model)
- The actual instructions (markdown describing the workflow step by step)
The harness discovers skills from three locations: bundled skills shipped with the CLI, project skills in .claude/skills/, and user skills in ~/.claude/skills/. It loads only the metadata (name, description) upfront. The full content loads only when the model decides to invoke a skill.
Here is what this means: the quality of your skill file directly determines the quality of the agent's output. A vague skill file produces vague behavior. A precise skill file produces precise behavior. Same model. Same harness. Same tools. The only variable is the spec.
This is the quality chain from The Spec Is the Product made real: Spec quality -> System quality -> Outcome quality. Every skill file you write for your Jarvis is a spec. Every CLAUDE.md is a spec. Every instruction you put in a context file is a spec. The model executes them literally.
6. Memory Is Declaratively Indexed
Claude Code's memory system lives in ~/.claude/projects/<slug>/memory/. It consists of:
- MEMORY.md: A master index file (200-line limit, 25KB max) containing one-line pointers to individual memory files
- Individual memory files: Markdown files with typed frontmatter (user, feedback, project, reference)
- An auto-discovery system that finds and attaches relevant memories at the start of each turn
The index is always loaded. Individual files are loaded when relevant. The model can write new memories, update existing ones, and delete stale ones.
Three design choices stand out:
Typed memories with structured frontmatter. Each memory has a type (user, feedback, project, reference), a name, and a description. The type tells the system when this memory is relevant. The description helps with discovery. This is not a blob of text. It is structured knowledge with metadata.
Bounded index size. The 200-line limit forces prioritization. You cannot store everything. You must decide what matters. This constraint is a feature: it prevents the context window from being consumed by memory overhead, leaving room for the actual work.
Write-through pattern. The model writes memories in a two-step process: first write the memory file, then update the index. This ensures the index stays in sync with the files. If the model writes a file but fails to update the index, the memory exists on disk but won't be discovered. This is a deliberate trade-off: consistency of the index is more important than completeness.
The Personal Jarvis Connection
The Personal Jarvis article describes five core components: user profile, relationship files, artifacts, transcripts, and skill files. Claude Code's memory system maps directly to this architecture:
| Jarvis Component | Claude Code Equivalent |
|---|---|
| User profile | user type memory files |
| Relationship files | project and reference type memories |
| Artifacts | Files in the project directory |
| Transcripts | Session transcripts (persisted to session.json) |
| Skill files | Skills in .claude/skills/ |
The Jarvis architecture IS the harness architecture. When we tell practitioners to build a folder of markdown files, they are building the same system that powers the most capable AI tool in the world. The only difference is scale and sophistication.
7. Hooks Make the Harness Event-Driven
Hooks are shell commands that execute at specific points in the agent lifecycle:
- PreToolUse: Fires before any tool executes. Can auto-approve, auto-deny, or inject additional context.
- PostToolUse: Fires after a tool completes. Can analyze results and trigger follow-up actions.
- SessionStart: Fires when a session begins. Can inject baseline context, check prerequisites, or configure the environment.
- SessionEnd: Fires when a session ends. Can persist state, send notifications, or clean up.
Hooks are configured in settings.json and receive structured JSON input about the triggering event. They return structured JSON output that the harness interprets.
This is what makes Claude Code extensible without modifying its source code. The entire Vercel plugin system, for example, runs through hooks: a SessionStart hook injects Vercel ecosystem knowledge, and a PreToolUse hook matches file patterns and bash commands against skill metadata to inject relevant guidance automatically.
The Self-Improving Enterprise Implication
The Self-Improving Enterprise article describes a progression: self-improving humans, then self-improving AI systems, then self-improving businesses. Hooks are the mechanism that enables step two.
A hook can watch what the agent does (PostToolUse), analyze patterns, and propose improvements. "I notice you keep running the same three commands after every deployment. Should I create a skill for this?" The hook does not need to modify the harness. It operates at the boundary, observing and suggesting.
This is the recursive improvement loop from the Harness Engineering article made concrete. The harness provides hooks. Hooks enable observation. Observation enables proposals. Proposals (approved by the human) improve the harness. The improved harness provides better hooks. And the cycle continues.
8. Message Normalization: The Boundary Between Internal and External
One of the most subtle and important patterns in the codebase is normalizeMessagesForAPI(). This function sits between the harness's internal message representation and what actually gets sent to the Claude API.
Internally, messages carry rich metadata: virtual messages (display-only, never sent to the API), tool use results with structured types, thinking blocks from the model's reasoning process, oversized image/PDF references that errored, and attachment ordering metadata.
Before any API call, normalizeMessagesForAPI() strips all of this:
- Virtual messages are removed
- Attachments are reordered to satisfy API requirements
- Failed image/PDF references are cleaned out
- Thinking blocks are merged with subsequent assistant messages
- Tool results are paired correctly with their tool calls
The model never sees the harness's internal bookkeeping. It sees a clean conversation with properly formatted messages.
Why This Pattern Matters
This is a boundary that separates concerns cleanly. The harness can evolve its internal representation (adding new metadata, new message types, new tracking information) without breaking the API contract. The model can change its API requirements without forcing changes to the harness's internal state.
For practitioners building their own agent systems: maintain this separation. Your internal state will always be richer than what the model needs to see. Do not leak implementation details into the model's context. It wastes tokens and confuses the model.
9. Budget Tracking Across Compaction
This is an engineering detail that reveals how much thought goes into long-running agent sessions.
Claude Code tracks multiple budgets simultaneously: token budget per turn (how much context the model can use), task budget per session (total allowed cost or tokens), and auto-compact tracking (when to trigger context compaction).
The clever part: when the context gets too long and the harness compacts earlier messages into a summary, the remaining budget information is stored in the compaction summary itself. When the loop continues after compaction, it reconstructs the budget from the summary. The budget survives compaction.
This means an agent can run for hours, processing thousands of messages, compacting multiple times, and never lose track of how much work it has done and how much it is allowed to do. The budget is not a counter in memory that resets when context is compacted. It is a persistent value woven into the conversation itself.
The Token Economy Connection
The Token Economy article argues that practitioners need to understand token costs as real economics. Claude Code's budget system makes this concrete: every API call is metered, every tool execution has a cost, and the system enforces hard limits. When the budget runs out, the agent stops. No exceptions.
For practitioners helping clients deploy agents: build budget tracking into your systems from day one. Agents without budgets will run up costs that surprise everyone. Agents with budgets operate within predictable economics that clients can plan around.
10. Plugins: Declarative Capability Registration
The plugin system is the harness's extension mechanism. A plugin is a declarative metadata object:
{
name: string
description: string
version: string
skills?: SkillDefinition[]
hooks?: HooksConfig
mcpServers?: MCPServerDefinition[]
isAvailable?: () => boolean
defaultEnabled?: boolean
}
Registration is separate from loading. The harness knows about all plugins upfront (what they can do, whether they are available, whether they are enabled), but only loads the actual implementation when needed. A plugin that provides five skills only loads those skill files when the model invokes them.
This is the Liberation Architecture pattern at the code level. Instead of replacing the harness's core functionality, plugins wrap additional capabilities around it. The core stays stable. The extensions evolve independently. New capabilities can be added without modifying existing code.
The Ten Patterns, Summarized
From this analysis, ten engineering patterns emerge that define what makes Claude Code's harness effective:
| # | Pattern | What It Does |
|---|---|---|
| 1 | State machine loop | Makes recovery, budgeting, and continuation explicit rather than hidden in recursion |
| 2 | Layered context | Loads the minimum sufficient context at each layer, caches stable layers, lazy-loads expensive ones |
| 3 | Dependency injection for tools | Keeps tools decoupled, testable, and composable via shared context objects |
| 4 | Permission boundaries | Encodes user intent as allow/deny rules that gate every tool execution |
| 5 | Specs as instructions | Skills are markdown files the model follows literally; spec quality determines output quality |
| 6 | Declarative memory index | Bounded, typed, structured memory with explicit relevance filtering |
| 7 | Event-driven hooks | Extensibility without modification; observation enables self-improvement |
| 8 | Message normalization | Clean boundary between internal richness and external API contract |
| 9 | Budget persistence through compaction | Long-running sessions never lose track of cost and progress |
| 10 | Declarative plugin registration | Capabilities declared upfront, loaded on demand, evolved independently |
Every one of these patterns is something a practitioner can apply at a simpler scale when building client systems, personal Jarvis setups, or enterprise agent architectures. You do not need 800,000 lines of TypeScript to use these patterns. You need the principles.
What This Means for Applied AI Practitioners
Your CLAUDE.md Is Layer 3
When you write a CLAUDE.md file for your project, you are writing Layer 3 of the context assembly. The harness discovers it, loads it, and injects it as user context on every turn. This is not a nice-to-have. It is architectural. The quality of your CLAUDE.md directly shapes the quality of every agent interaction in that project.
Your Skill Files Are Specs
Every skill file you write for your Jarvis is a spec that the model follows literally. If your skill says "ask the user for context before proceeding," the model asks. If your skill says "write the output to artifacts/," the model writes there. The spec IS the product.
Your Folder Structure Is Your Context Architecture
The harness walks your directory tree looking for CLAUDE.md files, skill files, and memory files. Where you put things determines when and how the model discovers them. A flat folder with everything in one place forces the harness to load everything. A structured hierarchy lets it load the right context for the right task.
Permission Design Is Intent Engineering
When you set up rules like "always allow reads, always ask before writes, never allow destructive bash commands," you are encoding intent into infrastructure. This is the work that the Intent Engineering article describes. It just happens to be expressed as permission rules rather than organizational strategy documents.
Budget Tracking Is Not Optional
Claude Code tracks every token and enforces hard limits. If you are building agent systems for clients, do the same. The Token Economy is not theoretical. It is a line item in your client's operating costs, and agents without budgets will eventually produce an unpleasant surprise.
The Recursive Insight
The deepest lesson from studying Claude Code's source is recursive: the harness that we used to study the harness is the harness we are studying.
The agent session where we analyzed these files was itself running through the exact architecture described above. Our CLAUDE.md files were being loaded as Layer 3 context. Our skill files were being invoked as specs. Our memory files were being prefetched and attached. The permission system was gating our tool calls. The budget tracker was metering our tokens.
This is the self-referential nature of harness engineering. You are always inside a harness. The question is whether you are aware of it, and whether you are designing it deliberately or letting it happen by default.
The MetaHarness research showed that harnesses can improve themselves. Claude Code's hook system provides the mechanism. And the practitioners who understand this architecture will be the ones who build the self-improving systems described in The Self-Improving Enterprise.
All software will be self-evolving software. We just got to see the source code of what that looks like today.
Further Reading
- Harness Engineering: The conceptual foundation this article builds on
- Context Engineering: The discipline behind Layer 2-5 of context assembly
- Intent Engineering: What the permission system is really encoding
- The Spec Is the Product: Why skill files as markdown specs matter
- Game Design: The framework that tools, permissions, and budgets implement
- Personal Jarvis: The practitioner-scale version of this architecture
- The Self-Improving Enterprise: Where recursive harness improvement leads
- The Token Economy: Why budget tracking is not optional
- Liberation Architecture: The pattern that plugins implement at the code level
- MetaHarness Paper (Stanford, MIT, Krafton, March 2026): The research proving that harness variation produces 6x performance gaps