Use when building a Planner→Generator→Evaluator multi-agent harness with the Claude SDK or Codex. Triggers: "build a harness", "multi-agent pipeline", "agent...
---
name: long-run-harness
description: >
Use when building a Planner→Generator→Evaluator multi-agent harness with the Claude SDK or Codex.
Triggers: "build a harness", "multi-agent pipeline", "agent loop", "automate app building
with Claude", "GAN-style agent system", "sprint-based Claude workflow", "I want Claude to
plan, build, and evaluate automatically", "long-running orchestrator".
NOT for: asking Claude to build an app directly, single-file edits, pure API usage questions.
---
# Long-Running App Harness — SDK Implementation
Produces a runnable harness that orchestrates Claude agents via `claude_agent_sdk`.
**You are writing the harness, not running inside it.**
Use `query()` + `ClaudeAgentOptions` for agentic loops; `tool()` + `create_sdk_mcp_server()`
for structured output. Never `anthropic.Anthropic()` directly.
```
pip install claude-agent-sdk
```
Output structure:
```
harness/
harness.py; config.yaml; config.py; log.py
agents/ planner.py; generator.py; evaluator.py
models/ state.py
prompts/ planner.md; generator.md; evaluator.md
```
---
## Routing
| User Signal | Route |
|---|---|
| "build a harness / pipeline" | Start at Phase 1 |
| "add an evaluator" | Jump to Phase 4 |
| "add state / handoff" | Jump to Phase 5 |
| "looping forever / broken" | Check feedback loop termination in Phase 5 |
| "just explain what a harness does" | Explain concept, don't write code |
---
## Phase 1: Design the Harness
**Load:** `$SKILL_DIR/instructions/planner-questions.md`
**⚠️ HARD GATE:** Ask the design questions. Get answers to 1–3 before writing any code:
1. What does the harness build? (sets Generator tools + Evaluator rubric)
2. Python or TypeScript? (default: Python)
3. Models per agent? (default: all claude-opus-4-7; non-defaults → `config.yaml`)
Create skeleton:
```bash
mkdir -p harness/agents harness/models harness/prompts harness/harness-logs
touch harness/harness.py harness/log.py harness/agents/__init__.py harness/models/__init__.py
```
**`config.yaml` + `config.py`** — all tunable parameters here; never hardcode in agent files.
**Load:** `$SKILL_DIR/instructions/config.md` for the full `HarnessConfig` dataclass.
```python
cfg = HarnessConfig.load(Path(__file__).parent / "config.yaml")
# Always: cfg.agents.generator_model — never: "claude-opus-4-7"
```
**`models/state.py`** — write first; all other files import from it.
**Load:** `$SKILL_DIR/instructions/context-handoff.md` (`HandoffState`, `EvalResult`, `format_handoff_for_prompt`).
**Load:** `$SKILL_DIR/instructions/sprint-contracts.md` (`SprintContract` + negotiation protocol).
**`log.py`** — dual stdout + timestamped file under `harness-logs/`.
**Load:** `$SKILL_DIR/instructions/logging.md` for full implementation.
```python
log.setup(PROJECT_DIR, label="run") # once in main()
logger = log.get() # in every agent
```
---
## Phase 2: Planner Agent
**Load:** `$SKILL_DIR/instructions/planner-questions.md` for system prompt template.
**Load:** `$SKILL_DIR/instructions/agent-patterns.md` for full `run_planner` implementation.
`run_planner(brief, session_id, cfg)` → `(reply, new_session_id)`.
`ClaudeAgentOptions(resume=session_id)` continues session without resending history.
```python
spec, session_id = "", None
while "SPEC_COMPLETE" not in spec:
user_input = input("[Planner asks]: ").strip() if session_id else initial_brief
spec, session_id = run_planner(user_input, session_id, cfg)
SPEC_PATH.write_text(spec.replace("SPEC_COMPLETE", "").strip())
```
---
## Phase 3: Generator Agent
**Load:** `$SKILL_DIR/instructions/agent-patterns.md` for `run_generator` + `self_assess` implementations.
```python
def run_generator(
spec, contract, project_dir,
handoff=None, strategic_framing=None, cfg=None,
) -> str: ...
ClaudeAgentOptions(
model=cfg.agents.generator_model,
allowed_tools=["Write", "Read", "Edit", "Bash", "Glob"],
cwd=str(project_dir), permission_mode="bypassPermissions",
)
```
After generation, call `self_assess()` — catches gaps before the Evaluator via
`submit_assessment` MCP tool. If not confident → extra pass with concerns as `strategic_framing`.
---
## Phase 4: Evaluator Agent
**Load:** `$SKILL_DIR/instructions/agent-patterns.md` for full implementation.
**Load:** `$SKILL_DIR/instructions/evaluation-rubrics.md` for system prompt + rubric criteria.
Two roles: `run_evaluator()` (post-generation gate) + `review_contract()` (pre-sprint criteria review).
```python
# submit_grade schema: contract_results[{id, status, evidence}], rubric_scores{id: 1–5}, feedback
def run_evaluator(spec, contract, app_url, rubric_track="A", cfg=None) -> EvalResult: ...
```
**⚠️ Deterministic verdict:** Never trust `verdict` from the LLM. Recompute in
`_build_eval_result()` from `contract_results` + `rubric_scores` using `cfg.verdict.*` thresholds.
---
## Phase 5: Harness Loop
**Load:** `$SKILL_DIR/instructions/iteration-loop.md` for `run_sprint`, `strategic_decision`, `git_commit`.
```python
def main():
cfg = HarnessConfig.load(Path(__file__).parent / "config.yaml")
log.setup(PROJECT_DIR, label="run")
def run_sprint(spec, contract, project_dir, handoff=None, cfg=None):
while iteration < cfg.loop.max_iterations:
# 1. Generate — try/except; crash is a valid (poor) outcome
# 2. Self-assess — extra pass if not confident
# 3. git_commit("wip: sprint N iter I")
# 4. Evaluate → EvalResult
# 5a. Pass + iteration < min_iterations → quality-improvement continue
# Pass + min_iterations met → git_commit("feat") + return
# 5b. Fail → strategic_decision() → REFINE or PIVOT → set strategic_framing
# Exhausted: input() if isatty() else return last result
```
Git checkpoints (see `iteration-loop.md` for `git_commit()` helper):
| Event | Message |
|---|---|
| SPEC written | `feat: generate SPEC.md` |
| Contract negotiated | `chore: sprint N contract` |
| Each iteration | `wip: sprint N iteration I` |
| Sprint passes | `feat: sprint N complete` |
Setup: `pip install claude-agent-sdk && export ANTHROPIC_API_KEY=sk-...`
Verify: `python -c "from agents.planner import run_planner; print('OK')"`
---
## Common Mistakes
| Mistake | Fix |
|---|---|
| Trusting LLM's `verdict` field | Recompute in `_build_eval_result()` from `contract_results` + `rubric_scores` |
| Hardcoding model names | Use `cfg.agents.generator_model` — never a string literal |
| Not calling `handoff.save()` before Evaluator | On crash, Evaluator result is lost |
| Using `input()` in CI | Guard with `sys.stdin.isatty()` first |
| Accumulating messages across sprints | Each sprint is a fresh `query()` call — no cross-sprint history |
| Marking `completed_features` from Generator claim | Only promote after Evaluator PASS verdict |
---
## When to Simplify
| Component | Remove / simplify when |
|---|---|
| Planner agent | User provides SPEC directly |
| Contract negotiation | Human has strong opinions; use config-file mode |
| Generator self-assessment | Evaluator consistently passes first attempt |
| `max_iterations` → 3 | Correctness-only task, no quality/aesthetic goal |
| `min_iterations` → 1 | Early passes are always good enough |
| Refine/pivot `strategic_decision` | Single sprint or correctness task |
| `HandoffState` | Sprint fits in one context window |
| Evaluator | Task within Generator's reliable baseline |
don't have the plugin yet? install it then click "run inline in claude" again.