How Anthropic Builds AI Agents That Don't Fall Apart — A Deep Dive into Harness Design
Anthropic's Prithvi Rajasekaran reveals the architecture behind long-running AI agents — and why naive single-agent designs always hit a ceiling.
A single prompt to Claude can get you a decent prototype in 20 minutes for $9. The same prompt, fed through a three-agent harness, runs for 6 hours, costs $200, and produces something that actually works. That gap — and the engineering behind it — is what Anthropic’s latest engineering post is really about.
🤔 What Is an Agent Harness?
An agent harness is the scaffolding around an AI model that orchestrates how it runs, not just what it’s asked to do.
Think of the LLM as a talented but distractible contractor. Left alone, it starts strong, drifts off-spec halfway through, and delivers something that looks finished but has broken wiring underneath. The harness is the project manager, version control system, QA department, and sprint planner — all rolled into one automated pipeline.
Without a harness, you get an agent. With the right harness, you get a system.
Harnesses typically handle:
- Context management — preventing the model from “forgetting” earlier decisions as tasks run long
- Task decomposition — breaking monolithic work into tractable chunks
- Multi-agent coordination — routing outputs between specialized agents
- Evaluation loops — catching failures before they compound
🧠 The Two Failure Modes Anthropic Identified
Anthropic’s Prithvi Rajasekaran ran extensive experiments and isolated two root causes of long-running agent failures:
1. Context Anxiety
As a context window fills up, models start rushing. They wrap up tasks prematurely, skip edge cases, and produce work that looks complete but isn’t. Claude Sonnet 4.5 exhibited this so strongly that summarization (compaction) alone couldn’t fix it.
Context resets — clearing the window entirely and handing the agent a structured summary of prior state — were the actual solution.
This is a key distinction from compaction:
| Approach | Mechanism | Preserves continuity? | Fixes context anxiety? |
|---|---|---|---|
| Compaction | Summarize old context in-place | ✅ Yes | ❌ No |
| Context Reset | Kill window, fresh agent + handoff artifact | ❌ No | ✅ Yes |
Claude Opus 4.6 largely resolved context anxiety natively, making resets unnecessary for the latest model — but the lesson stands: know your model’s failure modes.
2. Self-Evaluation Bias
Ask any LLM to critique its own output and it will praise it. This isn’t modesty failure — it’s a training artifact. Models are rewarded for being helpful and positive. Asking a generator to also be its own critic is like hiring someone and asking them to write their own performance review.
The fix: separate the generator from the evaluator completely.
Tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work.
A dedicated evaluator, calibrated with few-shot examples and explicit grading criteria, produces feedback the generator can actually act on.
🏗️ The Three-Agent Architecture
Anthropic’s final harness uses three specialized agents, each addressing a specific gap:
graph TD
A[👤 User Prompt\n1–4 sentences] --> B[🗺️ Planner Agent]
B --> |Full product spec\n+ design language| C[⚙️ Generator Agent]
C --> |Sprint work complete| D[🔍 Evaluator Agent]
D --> |Pass: continue| E[Next Sprint]
D --> |Fail: feedback| C
E --> F{All sprints done?}
F --> |No| C
F --> |Yes| G[✅ Final Application]
style A fill:#4A90D9,color:#fff
style B fill:#9B6EBD,color:#fff
style C fill:#5BA85A,color:#fff
style D fill:#E8A838,color:#fff
style G fill:#4A90D9,color:#fff
🗺️ Agent 1: The Planner
Takes a 1–4 sentence prompt and expands it into a full, ambitious product spec — including a visual design language, AI integration opportunities, and high-level technical direction.
Critically, it avoids granular implementation details. Why? Because wrong low-level specs cascade into broken implementations. The planner constrains what to build; the generator figures out how.
Design principle: Let agents own their domain. The planner owns scope; the generator owns implementation. Overlap creates drift.
⚙️ Agent 2: The Generator
Works in sprints, one feature at a time. Before each sprint, it negotiates a sprint contract with the evaluator — agreeing on what “done” looks like before any code is written.
- Stack: React + Vite + FastAPI + SQLite/PostgreSQL
- Has git for version control
- Self-evaluates at sprint end before handing to QA
- Communication via files (not direct API calls) — one agent writes, the other reads
The sprint contract is the key innovation here. It bridges the gap between high-level user stories and testable implementation criteria, without over-specifying too early.
🔍 Agent 3: The Evaluator
The evaluator doesn’t just read code — it uses Playwright MCP to click through the running application like a real user, testing UI flows, API endpoints, and database states.
It grades each sprint against four criteria (adapted from the frontend design work):
| Criterion | What it checks |
|---|---|
| Product depth | Does the app match the spec’s ambition? |
| Functionality | Do features actually work end-to-end? |
| Visual design | Is the interface polished and coherent? |
| Code quality | Is the implementation clean and maintainable? |
Each criterion has a hard threshold. One failure = sprint fails = generator gets specific, actionable feedback.
Out of the box, Claude is a poor QA agent. It identifies issues, then talks itself into approving the work anyway. Calibrating the evaluator with explicit prompting and examples takes real iteration.
🎨 The GAN-Inspired Design Loop
The generator-evaluator pairing is directly inspired by Generative Adversarial Networks (GANs). In a GAN:
- The generator tries to produce realistic outputs
- The discriminator tries to distinguish real from fake
- Adversarial tension drives both to improve
Anthropic’s harness applies the same logic: the generator tries to build quality software, the evaluator tries to catch its failures. The feedback loop forces the generator toward stronger outputs over time.
This was first tested on frontend design, where the self-evaluation problem is starkest (there’s no binary pass/fail for aesthetics). The grading criteria included:
- Design quality — does it feel like a coherent whole?
- Originality — any evidence of deliberate creative choices vs. AI defaults?
- Craft — typography, spacing, contrast (baseline competence check)
- Functionality — can users complete tasks without guessing?
Including language like “the best designs are museum quality” in the evaluator prompt directly shifted the character of generator outputs — suggesting prompt wording encodes implicit aesthetic direction.
The Dutch art museum example is telling: after 9 iterations of polished-but-expected dark-themed pages, the 10th iteration scrapped everything and rendered the museum as a 3D CSS room with perspective floors, wall-hung artwork, and doorway navigation. A creative leap that never happens in a single pass.
📊 Solo vs. Harness: The Numbers
Anthropic benchmarked both approaches on building a 2D retro game maker from one sentence:
| Harness | Duration | Cost | Result |
|---|---|---|---|
| Solo agent | 20 min | $9 | Broken play mode, layout issues |
| Full harness | 6 hr | $200 | Fully playable game with AI integration |
The solo run looked reasonable at first glance. The harness run expanded the same prompt into a 16-feature spec across 10 sprints, adding sprite animation, sound effects, AI-assisted level design, and export with shareable links — none of which the solo agent attempted.
The evaluator caught specific, actionable bugs like:
1
2
3
4
5
Rectangle fill tool: FAIL — fillRectangle function exists but
isn't triggered on mouseUp. Fills only at drag start/end points.
Frame reorder API: FAIL — FastAPI route defined after /{frame_id},
'reorder' parsed as integer frame_id, returns 422.
That level of specificity means the generator has something concrete to fix — not vague “it feels off” feedback.
🔄 Iterating on the Harness — What’s Load-Bearing?
One of the most valuable insights: every harness component encodes an assumption about what the model can’t do solo. As models improve, those assumptions need re-examination.
With Claude Opus 4.6, Anthropic removed the sprint construct entirely. Opus 4.6 can plan, sustain, and debug long tasks natively — the sprint decomposition was no longer load-bearing.
The updated harness breakdown for a browser-based DAW:
| Agent & Phase | Duration | Cost |
|---|---|---|
| Planner | 4.7 min | $0.46 |
| Build (Round 1) | 2 hr 7 min | $71.08 |
| QA (Round 1) | 8.8 min | $3.24 |
| Build (Round 2) | 1 hr 2 min | $36.89 |
| QA (Round 2) | 6.8 min | $3.09 |
| Build (Round 3) | 10.9 min | $5.88 |
| QA (Round 3) | 9.6 min | $4.06 |
| Total | 3 hr 50 min | $124.70 |
The evaluator still caught real gaps — missing mic capture, non-interactive clips, stub-only features. But the generator ran coherently for over 2 hours without sprint scaffolding. That’s the model doing more; the harness doing less.
Rule of thumb: Start with the simplest harness that works. Add complexity only when you’ve identified a specific failure mode it addresses. Strip components when models improve past the problem they were solving.
💡 One-Sentence Intuition
An agent harness is the engineering infrastructure that transforms a capable-but-scattered LLM into a reliable, long-running autonomous system — by outsourcing context management, task decomposition, and self-evaluation to specialized agents rather than hoping one model handles everything.
🔭 What This Means Going Forward
The harness design space doesn’t shrink as models improve — it moves. Better models push the frontier of what’s achievable solo, which opens new space for harnesses to tackle even more complex targets.
Key takeaways for AI engineers:
- Read your traces — understand where your model fails on realistic tasks before adding scaffolding
- Separate generator from evaluator — never ask one agent to both produce and grade its own work
- Context resets vs. compaction — know which your model actually needs
- Sprint contracts — negotiate “done” criteria before writing code, not after
- Re-examine on model upgrades — what was load-bearing with GPT-4 may be unnecessary overhead today
Reading notes from Anthropic’s engineering blog. Full article: Harness design for long-running application development