Design, test, review, and maintain agent skills for OpenClaw systems using multi-agent iterative refinement. Orchestrates Designer, Reviewer, and Tester suba...
SKILL.md

---
name: skill-engineer
description: Design, test, review, and maintain agent skills for OpenClaw systems using multi-agent iterative refinement. Orchestrates Designer, Reviewer, and Tester subagents for quality-gated skill development. Use when user asks to "design skill", "review skill", "test skill", "audit skills", "refactor skill", or mentions "agent kit quality".
metadata:
  author: skill-engineer
  version: 3.6.0
  owner: main agent (or any agent in the kit requiring skill development capability)
  based_on: Anthropic Complete Guide to Building Skills for Claude (2026-01), Anthropic Improving skill-creator (2026-03-03)
---

# Skill Engineer

Own the full lifecycle of agent skills in your OpenClaw agent kit. The entire multi-agent workflow depends on skill quality — a weak skill produces weak results across every run.

**Core principle:** Builders don't evaluate their own work. This skill enforces separation of concerns through a multi-agent architecture where design, review, and testing are performed by independent subagents.


## Skill Taxonomy

> Source: Anthropic "Improving skill-creator" (2026-03-03)

Skills fall into two categories. This distinction drives design decisions, testing strategy, and lifecycle management.

### Capability Uplift Skills

The model can't do it well alone — the skill injects techniques, patterns, or constraints that produce better output than prompting alone.

**Examples:** Document creation skills (PDF generation), complex formatting, specialized analysis pipelines.

**Testing focus:** Monitor whether the base model has caught up. If the base model passes your evals *without* the skill loaded, the skill's techniques have been incorporated into model default behavior. The skill isn't broken — it's no longer necessary.

**Lifecycle:** These skills may "retire" as models improve. Build evals that can detect when retirement is appropriate.

### Encoded Preference Skills

The model can already do each step — the skill sequences operations according to your team's specific process.

**Examples:** NDA review against set criteria, weekly report generation from specific data sources, brand compliance checks.

**Testing focus:** Verify the skill faithfully reproduces your actual workflow, not the model's "free improvisation." Fidelity to process is the metric.

**Lifecycle:** These skills are durable — they encode organizational knowledge that doesn't change with model capability. They need maintenance when *processes* change, not when *models* change.

### Design Implication

When the Designer begins work, classify the skill:

| Classification | Design priority | Test priority | Retirement risk |
|---------------|----------------|---------------|-----------------|
| Capability uplift | Technique precision | Base model comparison | High — monitor model progress |
| Encoded preference | Process fidelity | Workflow reproduction | Low — tied to org process |

## Mandatory Dependencies

This skill **requires** the following to be installed and available:

| Dependency | Type | Purpose | Install from |
|------------|------|---------|-------------|
| `deepwiki` | Skill | Query OpenClaw source for current API behavior | `liaosvcaf/openclaw-skill-deepwiki` |
| Vector memory DB | OpenClaw feature | Semantic search across session history, notes, and memory files | Enable in `openclaw.json` (`memory.enabled: true`) |

**Before starting any skill design or update session, verify both are available:**
```bash
# Check deepwiki
ls ~/.openclaw/skills/deepwiki/deepwiki.sh || ls ~/.openclaw/workspace-*/skills/deepwiki/deepwiki.sh

# Check vector memory (should return results, not empty)
# Use the memory_search tool with a known topic from recent sessions
```

If deepwiki is missing, install from `liaosvcaf/openclaw-skill-deepwiki`.
If vector memory returns no results on known topics, check that `memory.enabled` is true in `openclaw.json` and that indexing has run.

### Why These Are Non-Negotiable

**DeepWiki:** OpenClaw APIs are version-specific. Without DeepWiki, skills are written against memory of past behavior — which drifts as OpenClaw updates. DeepWiki grounds skill content in actual source code. A skill engineer without DeepWiki is working blind.

**Vector memory DB:** Session history, Obsidian notes, and past decisions are indexed here. Without it, the agent falls back to manual file search — slower, less accurate, and misses cross-document connections. Critical context from past sessions (installation guides, design decisions, pitfalls) lives in this index.

## Memory Search Protocol (MANDATORY)

Before searching files manually, always query the vector memory database first. It indexes session history, Obsidian notes, and memory files — and finds cross-document connections that manual search misses.

**When to query vector memory:**
- User asks "do you remember...", "find the notes about...", "we did X before..."
- Looking for past installation guides, design decisions, or troubleshooting records
- Any question about prior work, configurations, or lessons learned

**How to query correctly:**
```
memory_search("your query here", maxResults=5)
```

**Critical rule: try multiple queries before giving up.**

If the first query returns empty, do NOT fall back to manual file search immediately. Try at least 3 different phrasings:

| First query fails | Try instead |
|-------------------|-------------|
| "Docker OpenClaw installation" | "dockerized openclaw Titan" |
| "dockerized openclaw Titan" | "openclaw isolation install guide" |
| Still empty | Then fall back to manual file search |

**Lesson learned (2026-03-03):** When asked to find Docker/OpenClaw installation notes, `memory_search` returned empty on the first query and the agent immediately switched to manual SQLite/file search. The correct approach was to try different query phrasings — the second attempt (`"dockerized OpenClaw installation Titan setup"`) returned 5 relevant results directly from indexed Obsidian notes. Manual file search is a last resort, not a first response.

---

## DeepWiki Staleness Protocol (MANDATORY)

OpenClaw APIs, skill loading behavior, subagent mechanics, and frontmatter fields are **version-specific**. Information in this skill or any skill referencing OpenClaw internals may be outdated.

**ALWAYS query DeepWiki when:**
- Designing a skill that uses `sessions_spawn`, tool calls, or OpenClaw-specific APIs
- Referencing skill frontmatter fields or loading precedence
- Updating an existing skill that has version-tagged sections
- The installed OpenClaw version differs from any version tag in the skill
- You are unsure whether an API, field, or behavior still exists

**How to check:**
```bash
# Check current OpenClaw version
openclaw --version

# Query DeepWiki for current behavior
~/.openclaw/skills/deepwiki/deepwiki.sh ask openclaw/openclaw "YOUR QUESTION"
```

**Do NOT rely on memory or this skill's documented behavior without verifying** when the topic is OpenClaw internals. DeepWiki is grounded in the actual source code. This skill's documentation is not.

**Verification checklist before shipping any skill that references OpenClaw internals:**
- [ ] Checked `openclaw --version` against version tags in the skill
- [ ] Queried DeepWiki to confirm API/field behavior is current
- [ ] Updated version tags if behavior has changed

---

## Scope & Boundaries

### What This Skill Handles
- Skill design: SKILL.md, skill.yml, README.md, tests, scripts, references
- Skill review: quality evaluation, rubric scoring, gap analysis
- Skill testing: self-play validation, trigger testing, functional testing
- Skill maintenance: iteration based on feedback, refactoring
- Agent kit audit: inventory, consistency, quality scoring across all skills

### What This Skill Does NOT Handle
- **Release pipeline** — publishing, versioning, changelogs belong to release processes
- **Repository management** — git submodules, repo creation, branch strategy belong to your VCS workflow
- **Deployment** — installing skills to agents, configuration management
- **Tracking** — progress tracking, task management, project boards
- **Infrastructure** — MCP servers, API keys, environment setup

### Where This Skill Ends
This skill produces **validated skill artifacts** (SKILL.md, skill.yml, README.md, tests, scripts). Once artifacts pass quality gates, responsibility transfers to whatever system handles publishing and deployment.

### Success Criteria

A skill development cycle is considered successful when:

1. **Quality gates passed** — Reviewer scores ≥28/33 (Deploy threshold)
2. **No blocking issues** — Tester reports no issues marked as "blocking"
3. **All artifacts generated** — SKILL.md, skill.yml, README.md, tests/, scripts/ (if needed), references/ (if needed)
4. **OPSEC clean** — No hardcoded secrets, paths, org names, or private URLs
5. **Scripts validated** — All deterministic validation scripts execute successfully on target platform(s)
6. **Trigger accuracy** — Tester reports ≥90% trigger accuracy (true positives + true negatives)

If any criterion fails, the skill returns to the Designer for revision.

### Inputs

When invoking this skill, the orchestrator must gather:

| Input | Description | Required | Source |
|-------|-------------|----------|--------|
| **Problem description** | What capability or workflow needs to be enabled | Yes | User conversation |
| **Target audience** | Which agent(s) will use this skill | Yes | User or inferred |
| **Expected interactions** | With users, APIs, files, MCP servers, other skills | Yes | Requirements discussion |
| **Inputs/Outputs** | What data the skill receives and produces | Yes | Requirements discussion |
| **Constraints** | Performance limits, security requirements, dependencies | No | User or system |
| **Prior feedback** | Review or test reports from previous iterations | No | Previous Reviewer/Tester |
| **Existing artifacts** | If refactoring/maintaining an existing skill | No | File system |

**Example requirements gathering:**
```
User: "I need a skill for analyzing competitor websites"

Orchestrator gathers:
- Problem: Automate competitor analysis with structured output
- Audience: research-agent
- Interactions: web_fetch, browser tool, writes markdown reports
- Inputs: competitor URLs, analysis criteria
- Outputs: comparison table, insights markdown
- Constraints: must complete in <60s per site
```

These inputs are then passed to the Designer to begin the design process.

---

## Architecture Overview

The skill-engineer uses a three-role iterative architecture. The orchestrator spawns subagents for each role and never does creative or evaluation work directly.

### Pattern Selection (IMPORTANT)

Two architecture modes are available. Choose based on complexity:

**Mode A: Director-Controlled (simple/short skill work)**
Use when: ≤2 phases, <10 minutes total, user interaction needed between phases (e.g., quick fixes, single-skill reviews).

```
Director/Orchestrator (main agent, depth 0)
    ├─ Spawn ──→ Designer (depth 1)
    ├─ Spawn ──→ Reviewer (depth 1)
    └─ Spawn ──→ Tester (depth 1)
```

Risk: announce-to-action gap — if user sends a message while waiting for a subagent, the main agent may handle that instead of chaining the next phase. Mitigate with cron safety net (see below).

**Mode B: Orchestrator Subagent Pattern (complex/long skill work)**
Use when: 3+ phases, >10 minutes, pipeline must not stall, parallel workers needed.

```
Director (user-facing, depth 0)
    └── Orchestrator (pipeline owner, depth 1)
        ├─ Spawn ──→ Designer (depth 2)
        ├─ Spawn ──→ Reviewer (depth 2)
        └─ Spawn ──→ Tester (depth 2)
```

The Director spawns a single Orchestrator subagent with the full task description. The Orchestrator owns the entire Design→Review→Test loop without yielding control between phases. User messages go to the Director; the pipeline runs uninterrupted.

**Required config for Mode B:**
```json
{
  "agents": { "defaults": { "subagents": { "maxSpawnDepth": 2 } } }
}
```

**Why Mode B is superior for complex work:**
- No announce-to-action gap (orchestrator chains phases immediately within the same session)
- Immune to user interruption between phases
- Persistent pipeline state without re-deriving from files each turn

Reference: `orchestrator-subagent-pattern-2026-02-28.md` (Obsidian notes) — documented after a real 70-minute pipeline stall incident.

### Mode A Safety Net (cron)

When using Mode A, set a cron safety net after each spawn to catch announce-to-action failures:
```
"Check if [designer/reviewer/tester] subagent has completed. If so and next phase not started, resume pipeline."
(fires 15 min after spawn)
```

### Iteration Loop

```
Designer → Reviewer ──pass──→ Tester ──pass──→ Ship
              │                  │
              fail               fail
              │                  │
              ▼                  ▼
         Designer revises   Designer revises
              │                  │
              ▼                  ▼
           Reviewer          Reviewer + Tester
              │
           (max 3 iterations, then fail)
```

**Exit conditions:**
- **Ship:** Reviewer scores ≥ 28/33 (85%+) AND Tester reports no blocking issues
- **Revise:** Reviewer or Tester found fixable issues (iterate)
- **Fail:** 3 iterations exhausted and still below quality bar

### Iteration Failure Path

After 3 failed iterations, the orchestrator must:

1. **Stop iteration** — do not continue trying
2. **Report failure to user** with:
   - Summary: "Skill development failed after 3 iterations"
   - All 3 iteration reports (Reviewer + Tester feedback)
   - Final quality score
   - List of unresolved blocking issues
3. **Present options to user:**
   - Provide more context or clarify requirements (restart with better inputs)
   - Simplify scope (reduce skill complexity and retry)
   - Abandon this skill (requirements may be infeasible)
4. **Do NOT silently fail** — always report to user and await decision

**Never:** Continue past 3 iterations or ship a skill that hasn't passed quality gates.

### Subagent Spawning Mechanism

> **Version note:** Verified against OpenClaw v2026.2.26. API may change.

In OpenClaw, subagents are spawned using the `sessions_spawn` tool (not a CLI command). Subagents run in isolated sessions, announce results back to the requester's channel when complete, and are auto-archived after 60 minutes by default.

**Key constraints on subagents:**
- Default max spawn depth is 1 (subagents cannot spawn further subagents unless `maxSpawnDepth: 2` is configured)
- Default max 5 active children per agent at once
- Subagents do NOT receive `SOUL.md`, `IDENTITY.md`, or `USER.md` — only `AGENTS.md` and `TOOLS.md`
- Use `runTimeoutSeconds` to prevent hanging (900s for Designer, 600s for Reviewer/Tester)
- Results are announced back automatically; reply `ANNOUNCE_SKIP` to suppress

---

## Director vs. Orchestrator Roles

This is the most important architectural decision. Understand it before proceeding.

### The Problem with Naive Single-Agent Control

The natural instinct is to have the main agent (you) directly manage the Design→Review→Test loop:

```
Main agent
    ├── spawns Designer → waits for announce → spawns Reviewer → waits → spawns Tester
```

**This breaks in three ways:**

1. **Announce-to-action gap:** When a subagent finishes, OpenClaw sends a completion announce that triggers a new LLM turn. The LLM *may* report results to the user and stop — treating the announce as informational rather than a pipeline trigger. There is no mechanism that forces the next action.

2. **Context loss:** Each new turn is a fresh LLM call. Between subagent completion and the next turn, there is no persistent state machine tracking "we're in iteration 2, reviewer passed, now run Tester." The agent must re-derive this from files every time — fragile over 3+ iterations.

3. **User message interruption:** If the user sends a message while the pipeline is between phases, the main agent handles that message instead of continuing. The pipeline stalls silently until the user notices.

**Real incident:** A book-writer pipeline stalled for 70 minutes because a research subagent completed and announced back, but the Director reported results to the user and stopped — never spawning the writing phase. (2026-02-28)

### The Solution: One Level of Indirection

Add an intermediate Orchestrator subagent that owns the pipeline. The main agent becomes the Director — it talks to the user. The Orchestrator does the pipeline work. They don't share context.

```
Director (main agent, depth 0)  ←→  User
    │
    └── Orchestrator (subagent, depth 1) — owns Design→Review→Test loop
        ├── Designer (depth 2)
        ├── Reviewer (depth 2)
        └── Tester (depth 2)
```

**Why this works:**
- The Orchestrator runs as a single continuous session. It processes each subagent's completion announce immediately — no turn boundary between phases, no gap.
- User messages go to the Director (depth 0), not the Orchestrator. The pipeline cannot be interrupted by user activity.
- The Orchestrator maintains full pipeline state throughout its run without re-deriving from files.

**Required config (add to `openclaw.json` before using this pattern):**
```json
{
  "agents": { "defaults": { "subagents": { "maxSpawnDepth": 2 } } }
}
```

### When to Use Each Mode

| Situation | Use | Why |
|-----------|-----|-----|
| Quick fix, single skill review, <10 min | Director-only (depth 1 subagents) | Simpler, fewer spawns |
| Full design cycle (Design+Review+Test) | Director + Orchestrator (depth 2) | Pipeline cannot afford to stall |
| Any pipeline with 3+ sequential phases | Director + Orchestrator (depth 2) | Announce-to-action gap becomes critical |
| maxSpawnDepth not set to 2 | Director-only with cron safety net | No choice — see fallback below |

### Fallback: Director-Only with Cron Safety Net

If `maxSpawnDepth: 2` is not configured, use Director-only mode but add a cron safety net after each subagent spawn:

```
After spawning Designer, register a cron job:
"Check if Designer has completed (look for output at /path/to/skill/SKILL.md).
 If completed and Reviewer not yet started, spawn Reviewer now."
(fires 15 minutes after spawn)
```

This mitigates but does not eliminate the announce-to-action gap.

---

## Director Responsibilities

The Director (main agent) talks to the user and kicks off the pipeline. It does NOT do design, review, or testing work.

1. **Gather requirements** from the user (problem, audience, inputs/outputs, interactions)
2. **Query DeepWiki** — if the skill touches any OpenClaw internals, query DeepWiki FIRST:
   ```bash
   ~/.openclaw/skills/deepwiki/deepwiki.sh ask openclaw/openclaw "RELEVANT QUESTION"
   ```
3. **Choose mode** — Director-only (simple) or Director+Orchestrator (full cycle)
4. **For Director+Orchestrator mode:** Spawn a single Orchestrator subagent with complete task description including: requirements, DeepWiki findings, artifact output path, quality rubric location, max iterations
5. **For Director-only mode:** Execute Orchestrator Responsibilities directly (see below)
6. **Relay final result** to user when pipeline completes

## Orchestrator Responsibilities

The Orchestrator (depth-1 subagent in Mode B, or main agent in fallback mode) owns the Design→Review→Test loop. It does NOT write skill content or evaluate quality — it only coordinates.

1. **Query DeepWiki** for any OpenClaw-specific topics in the requirements (if Director hasn't already)
2. **Spawn Designer** with requirements, DeepWiki findings, and any prior feedback
   ```
   sessions_spawn(
     task="Act as Designer. Requirements: [...]. Write artifacts to /path/to/skill/",
     label="skill-v1-designer",
     runTimeoutSeconds=900
   )
   ```
3. **Collect Designer output** — verify all required files exist at output path
4. **Spawn Reviewer** with artifacts and quality rubric
   ```
   sessions_spawn(
     task="Act as Reviewer. Evaluate skill at /path/to/skill/ using rubric: [...]. Score all 33 checks.",
     label="skill-v1-reviewer",
     runTimeoutSeconds=600
   )
   ```
5. **Collect Reviewer feedback** (scores + structured issues)
6. **If score <28/33 or blocking issues:** feed feedback back to Designer → go to step 2, increment iteration count
7. **If passing review:** Spawn Tester
   ```
   sessions_spawn(
     task="Act as Tester. Run self-play on skill at /path/to/skill/. Test triggers, functional steps, edge cases.",
     label="skill-v1-tester",
     runTimeoutSeconds=600
   )
   ```
8. **Collect Tester results** (pass/fail + report)
9. **If blocking issues:** feed test results back to Designer → go to step 2
10. **If all pass:** add quality scorecard to README.md → announce completion to Director
11. **Track iteration count** — after 3 failed iterations, report failure with all iteration logs

### Final Review Scores in README

Every shipped skill must include a quality scorecard in its README.md. This is the Reviewer's final scores, added by the Orchestrator before delivery:

```markdown
## Quality Scorecard

| Category | Score | Details |
|----------|-------|---------|
| Completeness (SQ-A) | 7/7 | All checks pass |
| Clarity (SQ-B) | 4/5 | Minor ambiguity in edge case handling |
| Balance (SQ-C) | 4/4 | AI/script split appropriate |
| Integration (SQ-D) | 4/4 | Compatible with standard agent kit |
| Scope (SCOPE) | 3/3 | Clean boundaries, no leaks |
| OPSEC | 2/2 | No violations |
| References (REF) | 3/3 | All sources cited |
| Architecture (ARCH) | 2/2 | Separation of concerns maintained |
| **Total** | **29/30** | |

*Scored by skill-engineer Reviewer (iteration 2)*
```

This scorecard serves as a quality certificate. Users can assess skill quality before installing.

### Version Control

The orchestrator manages git commits throughout the workflow:

**When to commit:**
- After Designer produces initial artifacts (iteration 1): `git add . && git commit -m "feat: initial design for <skill-name>"`
- After Designer revisions (iteration 2+): `git add . && git commit -m "fix: address review issues (iteration N)"`
- After Tester passes and before ship: `git add README.md && git commit -m "docs: add quality scorecard for <skill-name>"`

**When to push:**
- After final ship (all gates passed): `git push origin main`
- Do NOT push intermediate iterations — only ship-ready artifacts

**Branch strategy:**
- Work in main branch for routine skill development
- Use feature branches for experimental or breaking changes

### Error Handling

The orchestrator must handle technical failures gracefully:

| Failure Type | Detection | Response |
|--------------|-----------|----------|
| **Git push fails** | Exit code ≠ 0 | Retry once. If fails again, report to user: "Cannot push to remote. Check network/permissions." |
| **OPSEC scan script missing** | File not found | Skip OPSEC automated check, but flag in review: "Manual OPSEC review required — script not found." |
| **File write errors** | Permission denied | Report: "Cannot write to [path]. Check file permissions." Fail workflow. |
| **Subagent crashes** | Timeout or error | Log the error, attempt retry once. If fails again, report: "Subagent failed. Manual intervention required." |
| **Review score = 0** | All checks fail | Report: "Skill failed all quality checks. Requirements may be unclear or skill design is fundamentally flawed. Recommend starting over." |

**Retry logic:**
- Git operations: 1 retry after 5s delay
- File operations: 1 retry after 2s delay
- Subagent spawns: 1 retry with fresh context

**Fail-fast rules:**
- If iteration count exceeds 3, fail immediately (no further retries)
- If OPSEC violations found, fail immediately (no iteration)
- If required files cannot be written, fail immediately

### Performance Notes

**Orchestrator workload:** Coordinating Designer/Reviewer/Tester across 1-3 iterations can be complex, especially for large skills (1000+ lines). The orchestrator manages:
- Requirements gathering
- Subagent coordination (3-9 spawns in typical workflow)
- Feedback routing between roles
- Iteration tracking
- Final scorecard assembly
- Git operations

**Token considerations:** A full 3-iteration cycle can consume 50k-150k tokens depending on skill complexity. For extremely complex skills, consider:
- Breaking into sub-skills (each with simpler scope)
- Using separate agent sessions (Option 2 spawning) to isolate token contexts
- Simplifying requirements before starting iteration

**If orchestrator feels overwhelmed:** This is a signal that the skill being designed may be too complex. Revisit the scope definition and consider decomposition.

### Spawning Context

Each subagent receives only what it needs:

| Role | Receives | Does NOT Receive |
|------|----------|------------------|
| Designer | Requirements, prior feedback (if any), design principles | Reviewer rubric internals |
| Reviewer | Skill artifacts, quality rubric, scope boundaries | Requirements discussion |
| Tester | Skill artifacts, test protocol | Review scores |

---

## Designer Role

**Purpose:** Generate and revise skill content.

**For complete Designer instructions, see:** `references/designer-guide.md`

### Quick Reference

**Inputs:** Requirements, design principles, feedback (on iterations 2+)

**Outputs:** SKILL.md, skill.yml, README.md, tests/, scripts/, references/

**Key constraints:**
- Apply progressive disclosure (frontmatter → body → linked files)
- Apply scoping rules (explicit boundaries, no scope creep)
- Apply tool selection guardrails (validate before execution)
- README for strangers only (no internal org details)
- Follow AI vs. Script decision framework

**Design principles:**
- Progressive disclosure (3-level system)
- Composability (works alongside other skills)
- Portability (same skill works across Claude.ai, Claude Code, API)

---

## Reviewer Role

**Purpose:** Independent quality evaluation. The Reviewer has never seen the requirements discussion — it evaluates artifacts on their own merits.

**For complete Reviewer rubric and scoring guide, see:** `references/reviewer-rubric.md`

### Quick Reference

**Inputs:** Skill artifacts, quality rubric, scope boundaries

**Outputs:** Review report with scores, verdict (PASS/REVISE/FAIL), issues, strengths

**Quality rubric (33 checks total):**
- SQ-A: Completeness (8 checks)
- SQ-B: Clarity (5 checks)
- SQ-C: Balance (5 checks)
- SQ-D: Integration (5 checks)
- SCOPE: Boundaries (3 checks)
- OPSEC: Security (2 checks)
- REF: References (3 checks)
- ARCH: Architecture (2 checks)

**Scoring thresholds:**
- 28-33 pass → Deploy (PASS verdict)
- 20-27 pass → Revise (fixable issues)
- 10-19 pass → Redesign (major rework)
- 0-9 pass → Reject (fundamentally flawed)

**Pre-review:** Run deterministic validation scripts before manual evaluation

---

## Tester Role

**Purpose:** Empirical validation via self-play. The Tester loads the skill and attempts realistic tasks.

**For complete Tester protocol, see:** `references/tester-protocol.md`

### Quick Reference

**Inputs:** Skill artifacts, test protocol

**Outputs:** Test report with trigger accuracy, functional test results, edge cases, blocking/non-blocking issues, verdict (PASS/FAIL)

**Test protocol:**
1. **Trigger tests** — verify skill loads correctly (≥90% accuracy threshold)
2. **Functional tests** — execute 2-3 realistic tasks, note confusion points
3. **Edge case tests** — missing inputs, ambiguous requirements, boundary cases

**Issue classification:**
- **Blocking:** Prevents skill from functioning (must fix before ship)
- **Non-blocking:** Impacts quality but doesn't break core functionality

**Pass criteria:** No blocking issues + ≥90% trigger accuracy

### Separation of Concerns Rule

> **The agent that DESIGNS a skill must NOT be the same agent that AUDITS it in the same session.**

This is a hard architectural rule, not a guideline. When the same agent designs and audits in one session, it creates structural circularity: the designer unconsciously frames evaluation in terms of their own intentions, missing gaps that a fresh reader would catch.

**Enforcement:**
- All audit work (Reviewer role, Tester role) MUST be performed by a fresh subagent spawned after design is complete.
- Use `openclaw agent --session-id <unique-id>` (Option 2 spawning) when auditing a skill the current session has designed.
- The orchestrator may never evaluate its own spawned Designer's output directly — it must route all evaluation through an independent Reviewer subagent.
- In role-based execution (Option 1), the agent must explicitly transition: complete all Designer work, then start the Reviewer role with no reference to design-time reasoning.

**Why this matters:**
- A designer who audits their own work will score it against their intentions, not against what a new agent will actually experience.
- The rubric (SQ-C3) explicitly prohibits a sub-agent from being both producer AND evaluator of the same output.
- This rule is the implementation of that check at the session level.

**Example — correct:**
```
# Session A: Designer work
sessions_spawn(
  task="Design a skill for X. Write artifacts to /path/to/skill/",
  label="skill-v1-designer",
  runTimeoutSeconds=900
)

# Session B: Audit (fresh session, no context from Session A)
sessions_spawn(
  task="Audit the skill at /path/to/skill/ using the reviewer rubric.",
  label="skill-v1-auditor",
  runTimeoutSeconds=600
)
```

**Example — incorrect:**
```
[Session A]
1. Design the skill...
2. Now let me review the skill I just designed...  ← VIOLATION
```

---


---

## Evals Framework

> Source: Anthropic "Improving skill-creator" (2026-03-03). Adapted for OpenClaw skill-engineer.

Evals turn "seems to work" into "verified to work." Every shipped skill should have persistent evals that can be re-run after model updates, skill edits, or environment changes.

### Eval Structure

An eval consists of:
1. **Test prompt** — a realistic user input that should trigger the skill
2. **Expected behavior description** — what "good" looks like (natural language, not exact match)
3. **Pass/fail criteria** — specific, observable conditions

Store evals in the skill's `tests/` directory:
```
tests/
├── evals.json           # Eval definitions
├── benchmarks/          # Benchmark run results (timestamped)
└── comparisons/         # A/B comparison results
```

### Eval Types

| Type | Purpose | When to run |
|------|---------|-------------|
| **Regression eval** | Catch quality drops after changes | After every skill edit or model update |
| **Capability eval** | Detect if base model has outgrown the skill | Monthly, or after major model releases |
| **Trigger eval** | Verify skill fires correctly | After description changes |

### Benchmark Mode

Run standardized assessments tracking:
- **Eval pass rate** — what percentage of evals pass
- **Elapsed time** — how long each eval takes
- **Token usage** — cost per eval run

Store benchmark results with timestamps for trend tracking:
```json
{
  "timestamp": "2026-03-04T12:00:00Z",
  "skill": "my-skill",
  "model": "claude-sonnet-4-5",
  "pass_rate": 0.85,
  "avg_time_s": 12.3,
  "avg_tokens": 4200,
  "evals_run": 10
}
```

### Comparator Testing (A/B Blind Test)

Compare two skill versions — or skill vs. no skill — using a blind judge:

1. Run the same test prompt through Version A and Version B
2. A **Comparator subagent** (fresh context, no knowledge of which is which) evaluates both outputs
3. The Comparator scores on relevant dimensions without knowing the source

**When to use:**
- Before shipping a major revision (old vs. new)
- To justify a skill's existence (with-skill vs. without-skill)
- To compare two alternative approaches during design

**Spawning a Comparator:**
```
sessions_spawn(
  task="You are a blind comparator. You will receive Output A and Output B for the same task. Score each on [dimensions]. You do NOT know which version produced which output. Be objective.",
  label="skill-comparator",
  runTimeoutSeconds=300
)
```

### Description Tuning

Skill descriptions determine trigger accuracy. As skill count grows, description precision becomes critical:
- Too broad → false triggers (skill loads when irrelevant)
- Too narrow → misses (skill doesn't load when needed)

**Tuning protocol:**
1. Collect 10-20 sample prompts (mix of should-trigger and should-not-trigger)
2. Run each prompt and check whether the skill triggers correctly
3. Analyze false positives and false negatives
4. Revise the `description` field to be more precise
5. Re-run trigger tests to verify improvement

**Target:** ≥90% trigger accuracy on sample prompts. Anthropic's internal testing improved 5 out of 6 public skills using this method.

### Skill Retirement Protocol

Skills are not forever. Capability uplift skills may become unnecessary as models improve.

**Retirement signal:** Base model passes ≥80% of the skill's evals *without* the skill loaded.

**Retirement process:**
1. Run capability evals with skill disabled
2. If pass rate ≥80%, flag skill as "retirement candidate"
3. Run comparator test (with-skill vs. without-skill) to confirm
4. If comparator shows no significant quality difference, retire the skill
5. Archive (don't delete) — the skill may become relevant again with different models

**Track in audit reports:**
```markdown
## Retirement Candidates

| Skill | Capability Eval (no skill) | Comparator Result | Recommendation |
|-------|---------------------------|-------------------|----------------|
| pdf-creator | 85% pass | No significant difference | Retire |
```

## Agent Kit Audit Protocol

Periodic full audit of the agent kit:

1. **Inventory all skills** — list every SKILL.md with owner agent
2. **Check for orphans** — skills that no agent uses
3. **Check for duplicates** — overlapping functionality
4. **Check for gaps** — workflow steps that have no skill
5. **Check balance** — are some agents overloaded while others idle?
6. **Check consistency** — naming conventions, output formats
7. **Run quality score** on each skill (SQ-A through SQ-D)
8. **Produce audit report** with scores and recommendations

### Audit Output Template

```markdown
# Agent Kit Audit Report

**Date:** [date]
**Skills audited:** [count]

## Skill Inventory

| # | Skill | Agent | Quality Score | Status |
|---|-------|-------|--------------|--------|
| 1 | [name] | [agent] | X/33 | Deploy/Revise/Redesign |

## Issues Found
1. ...

## Recommendations
1. ...

## Action Items
| # | Action | Priority | Owner |
|---|--------|----------|-------|
```

---

## Skill Interaction Map

Maintain a map of how skills interact:

```
orchestrator-agent (coordinates workflow)
    ├── content-creator (writes content)
    │   └── consumes: research outputs, review feedback
    ├── content-reviewer (reviews content)
    │   └── produces: review reports
    ├── research-analyst (researches topics)
    │   └── produces: research consumed by content-creator
    ├── validator (validates outputs)
    └── skill-engineer (this skill — meta)
        └── consumes: all skills for audit
```

Adapt this to your specific agent architecture.

---

## OpenClaw Skill System Reference

> **Version note:** This section is based on OpenClaw **v2026.2.26**. Skill system behavior (frontmatter fields, loading precedence, subagent APIs) may change across versions. Verify against source or DeepWiki when upgrading.

### Skill Structure

A skill is a **directory** containing at minimum a `SKILL.md` file:

```
my-skill/
├── SKILL.md          # Required: frontmatter + instructions
├── skill.yml         # Optional: ClawhHub publish metadata
├── README.md         # Optional: human-facing documentation
├── scripts/          # Optional: deterministic helper scripts
├── tests/            # Optional: test cases and fixtures
└── references/       # Optional: detailed linked documentation
```

### SKILL.md Frontmatter Format

Required fields:
```yaml
---
name: skill-name          # kebab-case, no spaces/capitals/underscores
description: |            # What it does + when to use it + trigger phrases
  [What it does]. Use when user [trigger phrases]. [Key capabilities].
---
```

Full supported fields:
```yaml
---
name: skill-name
description: ...
homepage: https://...                    # URL for Skills UI
user-invocable: true                     # Expose as slash command (default: true)
disable-model-invocation: false          # Exclude from model prompt (default: false)
command-dispatch: tool                   # Bypass model, dispatch to tool directly
command-tool: tool-name                  # Tool to invoke when command-dispatch is set
command-arg-mode: raw                    # Argument forwarding mode (default: raw)
metadata: {"openclaw": {"always": true, "emoji": "🔧", "os": ["darwin","linux"], "requires": {"bins": ["curl","python3"]}, "primaryEnv": "MY_API_KEY"}}
---
```

`metadata.openclaw` load-time gates:
| Field | Purpose |
|-------|---------|
| `always: true` | Always include, skip all other gates |
| `emoji` | Emoji shown in macOS Skills UI |
| `os` | Limit to platforms: `darwin`, `linux`, `win32` |
| `requires.bins` | All binaries must exist on PATH |
| `requires.anyBins` | At least one binary must exist |
| `requires.env` | Environment variables must exist |
| `requires.config` | openclaw.json paths must be truthy |
| `primaryEnv` | Links to `skills.entries.<name>.apiKey` in config |

### Skill Loading Precedence

Skills are loaded from these locations (highest → lowest priority):

1. `<workspace>/skills/` — agent-specific, highest precedence
2. `~/.openclaw/skills/` — shared across all agents on machine
3. `skills.load.extraDirs` in `openclaw.json` — additional directories
4. Bundled skills — shipped with OpenClaw, lowest precedence
5. Plugin skills — listed in `openclaw.plugin.json`

### When to Use Each Location

| Location | Use when |
|----------|----------|
| `<workspace>/skills/` | Skill is specific to one agent's role; under active development |
| `~/.openclaw/skills/` | Skill should be available to all agents on this machine |

### How Skills Are Triggered

OpenClaw builds a system prompt with a compact XML list of available skills (name, description, location). The model reads this list and decides which skills to load. Skills are NOT auto-injected — the model must explicitly `read` the `SKILL.md` when needed.

**Trigger accuracy goal:** ≥90% (skill loads when relevant, does NOT load when irrelevant).

### Skill Discovery Command

To inventory all skills on a machine:
```bash
find ~/.openclaw/ -name "SKILL.md" -not -path "*/node_modules/*" | sort
```

## Configuration

No persistent configuration required. The skill uses tools available in the agent's environment.

| Requirement | Description |
|-------------|-------------|
| `deepwiki` skill | Query OpenClaw source for current API behavior (`liaosvcaf/openclaw-skill-deepwiki`) |
| Vector memory | Semantic search across session history (`memory.enabled: true` in openclaw.json) |
| `gh` CLI | GitHub repo creation and visibility changes for release pipeline |
| `clawhub` CLI | Publish skills to ClawhHub registry (`npm i -g clawhub`) |

See `references/designer-guide.md` for full environment setup.
Skill Engineer

SKILL.md

related skills