How Anthropic Builds AI Agents That Don't Fall Apart — A Deep Dive into Harness Design

Anthropic's Prithvi Rajasekaran reveals the architecture behind long-running AI agents — and why naive single-agent designs always hit a ceiling.

Posted Mar 26, 2026

Neural network architecture visualization

By YuXuan Yan

8 min read

How Anthropic Builds AI Agents That Don't Fall Apart — A Deep Dive into Harness Design

A single prompt to Claude can get you a decent prototype in 20 minutes for $9. The same prompt, fed through a three-agent harness, runs for 6 hours, costs $200, and produces something that actually works. That gap — and the engineering behind it — is what Anthropic’s latest engineering post is really about.

🤔 What Is an Agent Harness?

An agent harness is the scaffolding around an AI model that orchestrates how it runs, not just what it’s asked to do.

Think of the LLM as a talented but distractible contractor. Left alone, it starts strong, drifts off-spec halfway through, and delivers something that looks finished but has broken wiring underneath. The harness is the project manager, version control system, QA department, and sprint planner — all rolled into one automated pipeline.

Without a harness, you get an agent. With the right harness, you get a system.

Harnesses typically handle:

Context management — preventing the model from “forgetting” earlier decisions as tasks run long
Task decomposition — breaking monolithic work into tractable chunks
Multi-agent coordination — routing outputs between specialized agents
Evaluation loops — catching failures before they compound

🧠 The Two Failure Modes Anthropic Identified

Anthropic’s Prithvi Rajasekaran ran extensive experiments and isolated two root causes of long-running agent failures:

1. Context Anxiety

As a context window fills up, models start rushing. They wrap up tasks prematurely, skip edge cases, and produce work that looks complete but isn’t. Claude Sonnet 4.5 exhibited this so strongly that summarization (compaction) alone couldn’t fix it.

Context resets — clearing the window entirely and handing the agent a structured summary of prior state — were the actual solution.

This is a key distinction from compaction:

Approach	Mechanism	Preserves continuity?	Fixes context anxiety?
Compaction	Summarize old context in-place	✅ Yes	❌ No
Context Reset	Kill window, fresh agent + handoff artifact	❌ No	✅ Yes

Claude Opus 4.6 largely resolved context anxiety natively, making resets unnecessary for the latest model — but the lesson stands: know your model’s failure modes.

2. Self-Evaluation Bias

Ask any LLM to critique its own output and it will praise it. This isn’t modesty failure — it’s a training artifact. Models are rewarded for being helpful and positive. Asking a generator to also be its own critic is like hiring someone and asking them to write their own performance review.

The fix: separate the generator from the evaluator completely.

Tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work.

A dedicated evaluator, calibrated with few-shot examples and explicit grading criteria, produces feedback the generator can actually act on.

🏗️ The Three-Agent Architecture

Anthropic’s final harness uses three specialized agents, each addressing a specific gap:

graph TD
    A[👤 User Prompt\n1–4 sentences] --> B[🗺️ Planner Agent]
    B --> |Full product spec\n+ design language| C[⚙️ Generator Agent]
    C --> |Sprint work complete| D[🔍 Evaluator Agent]
    D --> |Pass: continue| E[Next Sprint]
    D --> |Fail: feedback| C
    E --> F{All sprints done?}
    F --> |No| C
    F --> |Yes| G[✅ Final Application]

    style A fill:#4A90D9,color:#fff
    style B fill:#9B6EBD,color:#fff
    style C fill:#5BA85A,color:#fff
    style D fill:#E8A838,color:#fff
    style G fill:#4A90D9,color:#fff

🗺️ Agent 1: The Planner

Takes a 1–4 sentence prompt and expands it into a full, ambitious product spec — including a visual design language, AI integration opportunities, and high-level technical direction.

Critically, it avoids granular implementation details. Why? Because wrong low-level specs cascade into broken implementations. The planner constrains what to build; the generator figures out how.

Design principle: Let agents own their domain. The planner owns scope; the generator owns implementation. Overlap creates drift.

⚙️ Agent 2: The Generator

Works in sprints, one feature at a time. Before each sprint, it negotiates a sprint contract with the evaluator — agreeing on what “done” looks like before any code is written.

Stack: React + Vite + FastAPI + SQLite/PostgreSQL
Has git for version control
Self-evaluates at sprint end before handing to QA
Communication via files (not direct API calls) — one agent writes, the other reads

The sprint contract is the key innovation here. It bridges the gap between high-level user stories and testable implementation criteria, without over-specifying too early.

🔍 Agent 3: The Evaluator

The evaluator doesn’t just read code — it uses Playwright MCP to click through the running application like a real user, testing UI flows, API endpoints, and database states.

It grades each sprint against four criteria (adapted from the frontend design work):

Criterion	What it checks
Product depth	Does the app match the spec’s ambition?
Functionality	Do features actually work end-to-end?
Visual design	Is the interface polished and coherent?
Code quality	Is the implementation clean and maintainable?

Each criterion has a hard threshold. One failure = sprint fails = generator gets specific, actionable feedback.

Out of the box, Claude is a poor QA agent. It identifies issues, then talks itself into approving the work anyway. Calibrating the evaluator with explicit prompting and examples takes real iteration.

🎨 The GAN-Inspired Design Loop

The generator-evaluator pairing is directly inspired by Generative Adversarial Networks (GANs). In a GAN:

The generator tries to produce realistic outputs
The discriminator tries to distinguish real from fake
Adversarial tension drives both to improve

Anthropic’s harness applies the same logic: the generator tries to build quality software, the evaluator tries to catch its failures. The feedback loop forces the generator toward stronger outputs over time.

This was first tested on frontend design, where the self-evaluation problem is starkest (there’s no binary pass/fail for aesthetics). The grading criteria included:

Design quality — does it feel like a coherent whole?
Originality — any evidence of deliberate creative choices vs. AI defaults?
Craft — typography, spacing, contrast (baseline competence check)
Functionality — can users complete tasks without guessing?

Including language like “the best designs are museum quality” in the evaluator prompt directly shifted the character of generator outputs — suggesting prompt wording encodes implicit aesthetic direction.

The Dutch art museum example is telling: after 9 iterations of polished-but-expected dark-themed pages, the 10th iteration scrapped everything and rendered the museum as a 3D CSS room with perspective floors, wall-hung artwork, and doorway navigation. A creative leap that never happens in a single pass.

📊 Solo vs. Harness: The Numbers

Anthropic benchmarked both approaches on building a 2D retro game maker from one sentence:

Harness	Duration	Cost	Result
Solo agent	20 min	$9	Broken play mode, layout issues
Full harness	6 hr	$200	Fully playable game with AI integration

The solo run looked reasonable at first glance. The harness run expanded the same prompt into a 16-feature spec across 10 sprints, adding sprite animation, sound effects, AI-assisted level design, and export with shareable links — none of which the solo agent attempted.

The evaluator caught specific, actionable bugs like:

Rectangle fill tool: FAIL — fillRectangle function exists but 
isn't triggered on mouseUp. Fills only at drag start/end points.

Frame reorder API: FAIL — FastAPI route defined after /{frame_id}, 
'reorder' parsed as integer frame_id, returns 422.

That level of specificity means the generator has something concrete to fix — not vague “it feels off” feedback.

🔄 Iterating on the Harness — What’s Load-Bearing?

One of the most valuable insights: every harness component encodes an assumption about what the model can’t do solo. As models improve, those assumptions need re-examination.

With Claude Opus 4.6, Anthropic removed the sprint construct entirely. Opus 4.6 can plan, sustain, and debug long tasks natively — the sprint decomposition was no longer load-bearing.

The updated harness breakdown for a browser-based DAW:

Agent & Phase	Duration	Cost
Planner	4.7 min	$0.46
Build (Round 1)	2 hr 7 min	$71.08
QA (Round 1)	8.8 min	$3.24
Build (Round 2)	1 hr 2 min	$36.89
QA (Round 2)	6.8 min	$3.09
Build (Round 3)	10.9 min	$5.88
QA (Round 3)	9.6 min	$4.06
Total	3 hr 50 min	$124.70

The evaluator still caught real gaps — missing mic capture, non-interactive clips, stub-only features. But the generator ran coherently for over 2 hours without sprint scaffolding. That’s the model doing more; the harness doing less.

Rule of thumb: Start with the simplest harness that works. Add complexity only when you’ve identified a specific failure mode it addresses. Strip components when models improve past the problem they were solving.

💡 One-Sentence Intuition

An agent harness is the engineering infrastructure that transforms a capable-but-scattered LLM into a reliable, long-running autonomous system — by outsourcing context management, task decomposition, and self-evaluation to specialized agents rather than hoping one model handles everything.

🔭 What This Means Going Forward

The harness design space doesn’t shrink as models improve — it moves. Better models push the frontier of what’s achievable solo, which opens new space for harnesses to tackle even more complex targets.

Key takeaways for AI engineers:

Read your traces — understand where your model fails on realistic tasks before adding scaffolding
Separate generator from evaluator — never ask one agent to both produce and grade its own work
Context resets vs. compaction — know which your model actually needs
Sprint contracts — negotiate “done” criteria before writing code, not after
Re-examine on model upgrades — what was load-bearing with GPT-4 may be unnecessary overhead today

Reading notes from Anthropic’s engineering blog. Full article: Harness design for long-running application development

AI Engineering, Agentic Systems

This post is licensed under CC BY 4.0 by the author.