clawhub

Agent Harness

Production-grade Agent Harness combining execution discipline, knowledge compounding, and product thinking into a single adaptive workflow. Use when: (1) bui...

view source

installs

stars

karma

SkillRank score ↗

8.3/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-06-10

obsidian-harness structures multi-agent engineering workflows into three layers (challenge, execute, compound) with auto-graded task complexity, explicit verification gates, and anti-rationalization guards. designed for production ai coding tasks with hard concurrency limits and checkpoint discipline.

structure

9.0

trigger phrases

9.0

procedure

9.0

edge cases

7.0

documentation

8.0

view original SKILL.md from clawhubclick to expand

---
name: agent-harness
description: "Production-grade Agent Harness combining execution discipline, knowledge compounding, and product thinking into a single adaptive workflow. Use when: (1) building features or fixing bugs with AI agents, (2) user says 'build', 'plan', 'spec', 'review', 'ship', 'debug', (3) managing multi-step or multi-agent tasks, (4) need structured engineering workflow with quality gates. Provides: task complexity auto-grading (simple/medium/complex), anti-rationalization guards, concurrent subagent scheduling (≤4 hard limit), tool-chain continuity enforcement, context budget management, verification protocols, and experience compounding. Triggers: 'agent harness', 'engineering workflow', 'build protocol', 'multi-agent task', 'coding discipline', 'subagent orchestration'."
version: 2.0.1
---

# Agent Harness

A unified engineering harness combining execution discipline, knowledge compounding, and product thinking. Born from ~450k characters of real-world AI textbook writing + 15+ production incidents.

> **GAIA benchmark shows scaffold design = 30pp+ performance boost** — same model, HAL scaffold 74.6% vs bare model ~44%. The harness is the multiplier.

## Core Philosophy

> Agent = Model + Harness. The model provides capability; the harness provides discipline.

Three layers, one workflow:

1. **Challenge** — Is this the right thing to build?
2. **Execute** — Build it with engineering rigor
3. **Compound** — Learn from what happened

## Task Complexity Auto-Grading

Before starting any task, assess complexity. This determines which workflow steps to run.

**🟢 Simple** (bug fix, config change, small tweak)
- Skip spec/plan → Direct edit → Verify → Done

**🟡 Medium** (new feature, module, integration)
- Plan → Build incrementally → Test → Review → Done

**🔴 Complex** (architecture change, multi-module, new system)
- Full pipeline: Challenge → Spec → Plan → Build → Test → Review → Ship

When unsure, start at 🟡. Upgrade to 🔴 if you discover hidden complexity. Never downgrade mid-task.

## Layer 1: Challenge (🔴 Complex tasks only)

Before writing any code, answer these questions:

1. **Problem validity** — Is the user solving a real problem?
2. **Simplest approach** — Is there a simpler way?
3. **Scope clarity** — Can you explain "done" in one sentence?
4. **Risk assessment** — What's the worst outcome if this goes wrong?

Output: A one-paragraph problem statement the user confirms before proceeding.

## Layer 2: Execute

### Spec (🟡🔴 only)

- **Goal**: One sentence describing the outcome
- **Interface**: Inputs, outputs, API contracts
- **Constraints**: What you will NOT do
- **Acceptance criteria**: How to verify it works (must be testable)

### Plan (🟡🔴 only)

Break the spec into atomic tasks:
- Each task modifies ≤3 files
- Each task has a clear verification step
- Tasks ordered by dependency (independent tasks can parallelize)

### Build

Execute tasks incrementally. After each task:
1. Verify the task works (run it, test it, check the output)
2. Checkpoint progress to file
3. Only then move to the next task

**Critical rules:**
- Never modify code you haven't read first
- Don't add features beyond what was asked
- Don't refactor "while you're at it"
- If tests fail, report honestly — don't claim success

### Verify

Every deliverable must have **evidence**, not just "looks good":

| Deliverable type | Required evidence |
|---|---|
| Code change | Tests pass (show output) |
| Config change | Restart + verify (show status) |
| File generation | `wc -l` + `grep` key content |
| API integration | Show actual response |
| Documentation | Spot-check 3 claims for accuracy |

🔴 **Reading is not verification. Run it.**

### Review (🟡🔴 only)

Self-review from 5 dimensions:
1. **Correctness** — Does it do what was asked?
2. **Edge cases** — Empty input, huge input, concurrent access?
3. **Security** — Injection points, leaked secrets, missing auth?
4. **Performance** — Will it work at 10x scale?
5. **Maintainability** — Will someone understand this in 6 months?

### Ship (🔴 only)

Pre-ship checklist:
- [ ] All tests pass
- [ ] Rollback plan exists (undo in <5 min?)
- [ ] Feature flag or gradual rollout if risky
- [ ] Monitoring covers the new code path

## Layer 3: Compound

After completing any task, spend 30 seconds on:

1. **What broke?** — Errors, retries, unexpected behavior? → Record the specific lesson
2. **What was slow?** — Bottlenecks? → Note them
3. **What would you do differently?** — Better approach with hindsight?

Only record **specific, actionable lessons**. Not generic advice.

**Good**: "Bedrock throttles at >4 concurrent requests. Use model rotation or serial execution."
**Bad**: "Remember to handle API limits properly."

## Anti-Rationalization Table

| Your excuse | Why it's wrong | Do this instead |
|---|---|---|
| "Too simple to need tests" | 40% of P0 incidents come from "too simple" code | Write the test. It takes 2 minutes. |
| "I already checked, looks fine" | Reading ≠ verifying | Run it. `ls`, `wc -l`, `grep`, actual execution. |
| "I'll write tests after the feature" | You won't. Test debt only grows. | Write the test NOW. |
| "This old code looks unused, I'll delete it" | Chesterton's Fence: understand before removing | `git blame` first. Ask why it exists. |
| "It should work" | "Should" is not evidence | Provide logs, output, or data. |
| "Let me refactor while I'm here" | Scope creep. | File a separate TODO for the refactor. |
| "I'll handle errors later" | Error handling IS the feature in production | Handle errors now. |
| "The context is too long, I'll skip details" | Skipping details = skipping correctness | Checkpoint to file, compact context, continue with full fidelity. |
| "I already ran it once, it should still work" | Stateful systems change. | Run it again. Every time. |

## Concurrent Subagent Scheduling

**Hard limits:**
- ≤4 subagents parallel (hard limit; check `subagents list` before spawning)
- System hard ceiling: 8
- 5+? Re-slice into sequential batches first
- Always check current count before spawning: `subagents(action=list)`

**Task delegation rules:**
- Instructions must be self-contained (paste content directly, don't reference files)
- Each subagent writes to its own independent output file
- Subagents never communicate directly — everything goes through coordinator
- Use `sessions_yield` after spawning, not a poll loop

**After yield returns — mandatory checks:**
1. `subagents(action=list)` — confirm all spawned subagents ended
2. `ls` output files — verify files exist with expected mtimes
3. If any subagent missing or no output file → investigate, don't assume success

> Why: OpenClaw subagent completion announce has a known race condition. Never rely on announce as the sole signal. Active verification is the backup system.

**Failure classification (before retrying):**
- Design failure? → Fix the spec first
- Alignment failure? → Clarify the instruction
- Verification failure? → The work was done but not confirmed
- See [references/mast-failure-taxonomy.md](references/mast-failure-taxonomy.md) for full taxonomy

## Tool-Chain Continuity (🔴 Critical)

Every tool call return must be followed by one of:
- Next tool call
- Progress message to user
- `sessions_yield`

**Never**: respond with "I'll continue..." and then have no tool call.

Pre-tool-return self-check:
- [ ] Task complete? No → what's the next tool call?
- [ ] Waiting for external input? → Send message explaining + yield
- [ ] "Thinking about next step"? → Danger signal. Pick an action NOW.

## Context Budget Management

| Water level | Mode | Action |
|---|---|---|
| < 70% | 🟢 Normal | Full mode, observation masking always on |
| 70–85% | 🟡 Auto-Concise | No new large files, tool output truncated, subagent instructions <1500 chars |
| 85–95% | 🟠 Preservation | No files >100 lines, force checkpoint to memory, delegate reads to subagent |
| > 95% | 🔴 Emergency | Flush state, alert user to /reset, stop accepting new tasks |

**Observation Masking** (apply immediately after consuming any tool output):
- After reading a file and extracting conclusions: don't re-quote the raw content
- After exec output: keep only key lines
- After subagent delivery: extract deliverable + quality verdict, discard process noise

## Critical Safety Rules

🔫 **Never restart your own process from inside an agent turn.**
- ❌ `systemctl restart <service>`, `pkill <process>`, `gateway restart` in cron prompts
- ✅ Use the platform's safe restart tool (e.g., `gateway` tool's `restart` action)
- Why: Agent terminal runs inside the gateway process. Restarting the service = SIGKILL yourself.

🔫🔫 **Never put restart commands in cron job prompts.**
- once job + agent turn + restart = suicide loop: cron fires → agent runs → restart kills agent → turn never completes → scheduler sees incomplete once job → re-fires on next boot → infinite loop
- Restart/self-check logic must live in an external wrapper (systemd ExecStartPost= or standalone systemd-run unit), completely outside the agent process.

## Verification Protocol

For important deliverables, use an independent verifier:

1. Verifier does NOT read the original requirements
2. Verifier only reads the output/deliverable
3. Verifier independently assesses: correct? complete? well-formed?
4. Core principle: **"The implementer is an LLM. Reading is not verification. Run it."**

## Checkpoint Protocol

Protect progress against crashes:

1. **Write to file after each step** — Don't accumulate results in memory
2. **Design tasks as idempotent** — Re-running produces the same result
3. **Only retry the failed step** — Don't restart from scratch
4. **Progress must be observable** — `ls` shows what's done, not model memory

See [references/checkpoint-patterns.md](references/checkpoint-patterns.md) for detailed patterns.

## Known Tool Pitfalls

- **`\n` literal in exec/write content**: In some platforms, multiline scripts passed as strings get `\n` treated as literal characters, not newlines. Always use real line breaks. Verify with `read` after writing.
- **Concurrent writes**: Multiple subagents writing to the same file = corruption. Each subagent must have its own output file.
- **Reading ≠ Verifying**: `grep` and `wc -l` are faster than `read` for verification. Use them.

## Quick Reference

```
🟢 Simple:  Edit → Verify → Done
🟡 Medium:  Plan → Build → Test → Review → Done
🔴 Complex: Challenge → Spec → Plan → Build → Test → Review → Ship → Compound
```

After every tool call: next action or yield. Never stall.

related skills

semantically similar in the cross-vendor index

clawhub

81% match

Sharpagent Engineering Lifecycle

SharpAgent Engineering Lifecycle — 6-phase engineering pipeline: Spec → Plan → Build → Verify → Review → Ship. Embedding five-factor review, calibration fram...

don't have the plugin yet? install it then click "run inline in claude" again.

Agent Harness

A unified engineering harness combining execution discipline, knowledge compounding, and product thinking. Born from ~450k characters of real-world AI textbook writing and 15+ production incidents.

GAIA benchmark shows scaffold design = 30pp+ performance boost. Same model, HAL scaffold 74.6% vs bare model ~44%. The harness is the multiplier.

intent

Use this skill when building features, fixing bugs, planning specs, reviewing code, shipping changes, or debugging with AI agents. Triggers on "agent harness", "engineering workflow", "build protocol", "multi-agent task", "coding discipline", or "subagent orchestration". The harness wraps three layers (Challenge, Execute, Compound) around agent work to enforce correctness, prevent rationalization, and extract lessons from every task. Run this skill whenever you need structured engineering discipline with quality gates, especially for multi-step or multi-agent tasks.

inputs

Agent execution environment:

Current context window percentage (auto-tracked by platform)
Subagent availability (subagents list returns count and status)
File system access (read, write, append)
Execution tools (shell, read, write, append, ls, wc, grep)
External APIs (Bedrock, GitHub, etc. if needed for the task)

Task inputs:

User request (build, plan, spec, review, ship, debug command)
Existing code or files (path to repo, config, docs)
Acceptance criteria (if provided; harness will infer if missing)
Constraints (budget, timeline, risk tolerance, rollback window)

External connections (if applicable):

Bedrock API (env var: AWS_BEDROCK_REGION, AWS_BEDROCK_MODEL_ID; rate limit: 4 concurrent requests max, use model rotation or serial fallback)
GitHub (env var: GITHUB_TOKEN, scope: repo, read:org; auth expires, check before large operations)
Monitoring/Logging (if shipping to prod, verify endpoint and credentials before merge)

Context budget:

Current token usage (tracked before each tool call)
Observation masking rules (defined in ## context budget management below)

procedure

Step 1: Auto-grade task complexity (all tasks)

Input: User request + existing code (if any)

Output: Complexity level (🟢 simple, 🟡 medium, 🔴 complex) + reasoning

How to grade:

🟢 Simple: bug fix, config change, small tweak. Single file, no architecture impact. ~1 file, <50 lines changed.
🟡 Medium: new feature, new module, integration. Spans 2-5 files, clear API contract, no system redesign.
🔴 Complex: architecture change, multi-module redesign, new system. Spans 5+ files, unknown unknowns, refactors existing abstractions.

Decision: If unsure, pick 🟡. Upgrade to 🔴 during execution if you discover hidden dependencies. Never downgrade.

Next step: Go to Step 2.

Step 2: Run Layer 1 (Challenge) if 🔴 complex (conditional)

Input: Task description, problem statement

Output: One-paragraph problem statement confirmed by user

Do this only if complexity = 🔴:

Answer these questions before any code:

Problem validity , Is the user solving a real problem? (Not a nice-to-have proxy?)
Simplest approach , Is there a simpler way to solve this?
Scope clarity , Can you explain "done" in one sentence? (If no, scope is fuzzy.)
Risk assessment , What's the worst outcome if this goes wrong? (Rollback time? Data loss? User impact?)

Write a one-paragraph problem statement. Show it to the user. Wait for confirmation.

If user rejects: Stop, don't proceed. Refine the problem statement and re-submit.

If user confirms: Go to Step 3.

If complexity is 🟢 or 🟡: Skip this step, go to Step 3.

Step 3: Run Layer 2 (Execute) , Spec phase (if 🟡 or 🔴)

Input: Complexity level, problem statement (if 🔴), user request

Output: Spec document (written to spec.md or inline if <200 words)

Do this for 🟡 and 🔴:

Write a spec with four sections:

Goal , One sentence describing the outcome.
Interface , Inputs, outputs, API contracts. What calls what? What returns what?
Constraints , What you will NOT do. (Explicit scope boundaries prevent creep.)
Acceptance criteria , How to verify it works. Must be testable (run a command, check output, assert a property).

Show the spec to the user before proceeding.

If 🟢: Skip spec, go to Step 4 (Plan).

Next step: Go to Step 4.

Step 4: Run Layer 2 (Execute) , Plan phase (if 🟡 or 🔴)

Input: Spec, existing codebase

Output: Task list (written to plan.md)

Do this for 🟡 and 🔴:

Break the spec into atomic tasks:

Each task modifies ≤3 files.
Each task has a clear, runnable verification step (not "review the code").
Tasks ordered by dependency. Independent tasks can run in parallel (future: subagent delegation).

Example format:

Task 1: [File] Create config schema
  Changes: new file `config.schema.json`
  Verify: `jq . < config.schema.json` (valid JSON)
  
Task 2: [Feature] Load config on startup
  Changes: `app.py` lines 12-18
  Verify: `python -m pytest tests/test_config.py -v` (pass)
  Depends on: Task 1

Show the plan to the user. Refine if needed.

If 🟢: Skip plan, go to Step 5 (Build).

Next step: Go to Step 5.

Step 5: Build (all tasks)

Input: Plan (or task description if 🟢), codebase, spec (if written)

Output: Code changes, config changes, generated files

Critical rules:

Never modify code you haven't read first.
Don't add features beyond what was asked.
Don't refactor "while you're at it" (file a TODO instead).
If tests fail, report honestly. Don't claim success.

For each task in the plan (or the single 🟢 task):

Read the files you'll modify (understand the code, don't blindly edit).
Make the change (edit, create, or append).
Run the verification step immediately (see Step 6 below).
Checkpoint progress to file (append to progress.md: "Task N complete. Changed files: X, Y, Z. Status: PASS.").
Only then move to the next task.

If a task fails verification: Debug, fix, re-verify. Only mark PASS when verification succeeds.

Next step: Go to Step 6.

Step 6: Verify every deliverable (all tasks)

Input: Code change, config change, file, API response, or doc

Output: Evidence (test output, command output, grep results, response JSON)

Core rule: Reading is not verification. Run it.

Deliverable type	Required evidence
Code change	Tests pass (show output, `pytest -v` or equivalent)
Config change	Restart + verify (show status, `systemctl status` or equivalent)
File generation	Line count + grep key content (`wc -l`, `grep "keyword"`)
API integration	Show actual response (curl output, JSON, status code)
Documentation	Spot-check 3 claims for accuracy (e.g., "does the code actually do what the doc says?")

For each deliverable:

Run the verification step from the plan (or define one if 🟢).
Capture the output (copy-paste or screenshot).
Verify the output matches the acceptance criteria.
If no match: debug and re-run, don't move on.
If match: record the evidence in progress file and move to the next task.

Next step: If all tasks pass verification, go to Step 7. If any fail, debug and re-run Step 5-6 for that task.

Step 7: Review (if 🟡 or 🔴)

Input: All code changes, all verification output

Output: Review summary (written to review.md)

Do this for 🟡 and 🔴:

Self-review from 5 dimensions:

Correctness , Does it do what was asked? Check against spec + acceptance criteria.
Edge cases , Empty input, huge input, concurrent access, network timeout, auth expiry? Can you think of 3 ways it could break?
Security , Injection points, leaked secrets, missing auth, overpermissioned API keys, SQL injection, unvalidated input?
Performance , Will it work at 10x scale? 100x? Any N+1 queries, infinite loops, unbounded memory?
Maintainability , Will someone understand this in 6 months? Are there comments on non-obvious logic? Can you find the code by searching a keyword?

For each dimension, pick one issue if you find one. Write it down with a fix (don't leave "TODO: think about this").

If 🟢: Skip review, go to Step 8.

Next step: Go to Step 8.

Step 8: Ship (if 🔴 only)

Input: All code changes, all verification output, review summary

Output: Deployment confirmation + rollback confirmation

Do this for 🔴 only:

Pre-ship checklist:

All tests pass (show output from Step 6).
Rollback plan exists (undo in <5 min? How? Write it down.).
Feature flag or gradual rollout if risky (new API endpoint? New database schema? Gate it.).
Monitoring covers the new code path (log key events, set up alerts).

After checks pass:

Run the deploy command or merge to main.
Verify the deployment succeeded (check logs, health check, or smoke test).
Record the deploy time and status in progress file.

If 🟢 or 🟡: Skip ship, go to Step 9.

Next step: Go to Step 9.

Step 9: Compound (all tasks)

Input: What happened during the task (errors, retries, surprises, bottlenecks)

Output: Lesson recorded in lessons.md

After completing any task, spend 30 seconds on:

What broke? , Errors, retries, unexpected behavior? Record the specific lesson.
What was slow? , Bottlenecks? Note them (e.g., "Bedrock throttles at >4 concurrent requests. Use model rotation or serial execution.").
What would you do differently? , Better approach with hindsight?

Record only specific, actionable lessons.

Good: "Bedrock throttles at >4 concurrent requests. Use model rotation or serial execution."

Bad: "Remember to handle API limits properly."

Write to lessons.md. If this is a lesson worth sharing, suggest it as a future skill refinement.

Task complete. Go to outcome signal section below.

decision points

If complexity is 🟢 (simple):

Skip Challenge (Step 2), Spec (Step 3), Plan (Step 4), Review (Step 7), Ship (Step 8).
Run only: Build (Step 5), Verify (Step 6), Compound (Step 9).
Time to completion: <10 minutes.

If complexity is 🟡 (medium):

Skip Challenge (Step 2), Ship (Step 8).
Run: Spec (Step 3), Plan (Step 4), Build (Step 5), Verify (Step 6), Review (Step 7), Compound (Step 9).
Time to completion: 30 minutes to 2 hours.

If complexity is 🔴 (complex):

Run all steps: Challenge (Step 2) through Ship (Step 8), plus Compound (Step 9).
Time to completion: 2+ hours (depends on scope).

If user rejects the problem statement (Challenge step):

Stop. Do not proceed to Spec.
Refine the problem statement with the user.
Re-submit for confirmation.

If a verification step fails:

Debug the failure (re-read code, check logs, test locally).
Fix the code or test.
Re-run the verification step.
Only mark the task PASS if verification succeeds.
Do not proceed to the next task until current task verifies.

If context budget exceeds 85%:

Activate Auto-Concise mode (Step within "Context Budget Management" below).
No new large files, truncate tool output, delegate reads to subagent if available.

If context budget exceeds 95%:

Activate Emergency mode.
Flush state to file, alert user to /reset, stop accepting new tasks.

If subagent count is at or near capacity (≤4 hard limit):

Check subagents list before spawning new subagents.
If 4 subagents already running, re-slice planned work into sequential batches.
Do not spawn the 5th subagent.

If a task requires reading a large file and context is tight (70-85%):

Use grep and wc -l instead of read to extract key info.
Checkpoint findings to file.
Clear observations from context (observation masking).

If tool call has no follow-up action:

Danger signal. Stop and pick an action immediately.
Never respond with "I'll continue..." without a next tool call.

output contract

Deliverables by complexity level:

🟢 Simple

Single file with change (or new file created).
Verification output (test pass, status check, or grep result).
progress.md with one entry: "Task complete. Changed files: X. Status: PASS."
lessons.md if any surprises (optional).

🟡 Medium

spec.md , Goal, Interface, Constraints, Acceptance criteria.
plan.md , Numbered tasks, dependencies, verification steps.
Changed files (2-5 files modified or created).
Verification output for each task (tests pass, curl output, etc.).
progress.md , One line per task: "Task N: [name]. Changed files: X, Y. Status: PASS."
review.md , 5-dimension review summary (correctness, edge cases, security, performance, maintainability).
lessons.md , Specific, actionable lessons.

🔴 Complex

challenge.md , Problem statement (one paragraph).
spec.md , Goal, Interface, Constraints, Acceptance criteria.
plan.md , Numbered tasks with dependencies, verification steps.
Changed files (5+ files modified or created).
Verification output for each task (tests, curl, logs, grep results).
progress.md , One line per task.
review.md , 5-dimension review.
Deployment confirmation (log output, timestamp, rollback plan).
lessons.md , Specific, actionable lessons.

File locations:

All files written to current working directory or task-specific folder.
Use ls to confirm files exist.
Use wc -l and grep to spot-check content.

Data formats:

Markdown for docs (*.md).
JSON for API responses (verified with jq).
Shell output for command verification (logged as-is).
Test output for code verification (pytest, unittest, or equivalent runner output).

outcome signal

Task is complete when:

Complexity correctly assessed , User agrees on 🟢, 🟡, or 🔴 level (implicit for 🟢).
Problem statement confirmed (🔴 only) , User says "yes, that's the problem."
Spec confirmed (🟡🔴 only) , User says "yes, this is what we're building."
All tasks verified , Every task shows evidence (test pass, command output, grep result, API response).
Review complete (🟡🔴 only) , Review summary filed with no blocking issues.
Deployment confirmed (🔴 only) , Deploy logs show success, rollback plan documented.
Lessons recorded , At least one specific lesson in lessons.md (or "no surprises" if genuinely clean run).

User-facing signals:

All task verification output is visible and matches acceptance criteria.
progress.md shows all tasks with PASS status.
No failing tests, no pending TODOs, no "I'll handle this later" statements.
If shipping (🔴), deploy confirmation logged with timestamp and rollback plan.

Failure signal:

Any task verification fails and is not debugged/fixed before marking complete.
Tests fail and errors are suppressed or ignored.
Spec or plan was skipped when it should have been written.
Review raised an issue and it was not fixed.

anti-rationalization table

Reference this table whenever you feel the urge to skip a step or declare success without evidence.

Your excuse	Why it's wrong	Do this instead
"Too simple to need tests"	40% of P0 incidents come from "too simple" code.	Write the test. It takes 2 minutes.
"I already checked, looks fine"	Reading is not verifying.	Run it. `ls`, `wc -l`, `grep`, actual execution.