skill quality audit

check a SKILL.md for the 6 structural components tier-1 scoring looks at: intent, inputs, procedure with numbered steps, decision points, output contract, outcome signal. produces a 0-10 score plus a checklist of missing components. trigger when an author wants to know why their skill ranks below others or asks "is this skill good".

view source

installs

stars

karma

SkillRank score ↗

9.2/ 10

evaluated by implexa, curated · 2026-05-27

check a SKILL.md for the 6 structural components tier-1 scoring looks at: intent, inputs, procedure with numbered steps, decision points, output contract, outcome signal

structure

9.5

trigger phrases

9.0

procedure

9.5

edge cases

9.0

documentation

9.0

strengths

view original SKILL.md from implexaclick to expand

---
description: check a SKILL.md for the 6 structural components tier-1 scoring looks at. produces a 0-10 score plus a checklist of missing components. trigger when an author wants to know why their skill ranks below others or asks "is this skill good".
---

# skill quality audit

audit any SKILL.md against the same structural rubric implexa's tier-1 scoring uses. catches missing components, weak trigger language, and structural gaps before they hurt your skill's ranking. useful for authors iterating on their own skills or reviewing teammates' submissions.

## intent

before publishing a skill (or after seeing one score lower than expected), check it against the 6-component rubric. the audit is structural only - it does not evaluate whether the skill works at runtime (that is tier-2). but structural completeness is what tier-1 scoring measures, and tier-1 drives the leaderboard rank.

## inputs

- the SKILL.md body as text (paste it or pass the path)
- optional: the slug, if you want to compare against a previously-saved version

## procedure

### step 1, parse the frontmatter

every well-formed SKILL.md starts with yaml frontmatter:
```
---
description: <one-line summary, includes trigger phrases>
---
```

audit the description:
- is it present?
- does it include "trigger when" or trigger-phrase language?
- is it under 280 chars (the embedding-input sweet spot)?

points off if missing or vague.

### step 2, check for each of the 6 components

scan the body for these headers (or close equivalents):

1. **intent**: what the skill exists to do, in 1-3 sentences. not the procedure, the why.
2. **inputs**: what the skill needs to run. tools, data, user context, prereqs.
3. **procedure**: numbered or stepped sequence of actions. each step has a "what to render" or "what to capture".
4. **decision points**: branches. "if X then Y, else Z" patterns. what to do when things are ambiguous.
5. **output contract**: what the skill produces. format, length, where it goes.
6. **outcome signal**: how to know it worked. what would success look like 7 days later.

assign 0-2 points per component. 0 = missing, 1 = present but thin, 2 = substantive.

### step 3, score the trigger phrases

look for explicit trigger phrases in the description or in a dedicated section. count distinct phrases (or example user messages). more is not always better - aim for 3-7 high-signal phrases that map to how real users would ask.

### step 4, scan for anti-patterns

deduct points for:
- **vague procedure** ("do X carefully"): step description without a concrete tool call
- **missing error handling**: no decision points for the common failure modes
- **no measurable outcome**: outcome signal is "user feels good" rather than something observable
- **bloat**: skill body over 8k chars without justification (truncates during embedding)

### step 5, compute the score and render

sum the component points (max 12) plus the trigger-phrase score (max 3) minus anti-pattern deductions. normalize to 0-10. round to one decimal.

## decision points

- **the skill body has no headers at all**: it might be using a flat narrative style. parse for the content of each component instead of strict header matching. give partial credit.
- **multiple skills in one file**: split it into separate audits. one SKILL.md per skill is the right unit.
- **the description is missing entirely**: this is a hard fail (score capped at 4.0) because trigger matching breaks without it.

## output contract

a structured audit report with:
- overall score (0-10, one decimal)
- per-component checklist (✓ / ⚠ / ✗) with one-line notes
- top 3 concrete suggestions for improvement (ranked by impact on tier-1 score)
- the predicted tier-1 score if the suggestions are applied

## outcome signal

after the author applies the suggestions, the skill's actual tier-1 score (from list_skill_scores) moves up by at least the predicted delta. if it does not, the audit's heuristics need tuning.

## notes

- structural completeness is necessary but not sufficient. a perfectly-structured skill that does the wrong thing is still bad. tier-2 dry-run scoring catches functional quality, this audit only catches structural quality.
- the 6-component rubric is the implexa house style. anthropic, smithery, and other registries use looser structures - their high-scoring skills usually still hit most of these components even when not labeled.
- when in doubt, copy the structure of an existing high-scored implexa-curated skill (look at /scores filtered to source=implexa).

related skills

semantically similar in the cross-vendor index

clawhub

81% match

Skill Audit

Audit and score OpenClaw AgentSkills against structural compliance, quality standards, and OpenClaw-specific architecture patterns. Produces a 0-100 score wi...

don't have the plugin yet? install it then click "run inline in claude" again.

added explicit input/output labels to each procedure step, expanded decision points with concrete branching logic, added anti-pattern scanning (vague steps, missing error handling, unmeasurable outcomes, external connection docs), clarified the scoring rubric with point breakdowns, and added structured output contract with markdown report format.

skill quality audit

intent

before publishing a skill (or after seeing one score lower than expected), run this audit against the 6-component rubric. the audit is structural only, it does not evaluate whether the skill works at runtime (that is tier-2). but structural completeness is what tier-1 scoring measures, and tier-1 drives the leaderboard rank. trigger this skill when an author asks "why did my skill score low" or "is this skill good" or wants to self-review before submission.

inputs

the SKILL.md body as text (paste it directly, upload the file, or pass a file path)
optional: the skill slug, if you want to compare against a previously-audited version
optional: tier-1 scoring thresholds (defaults to implexa standards)

no external connections required. this skill runs entirely on text parsing.

procedure

step 1, parse and validate the frontmatter

extract yaml frontmatter (lines between opening and closing ---).

input: raw SKILL.md text output: frontmatter dict or null if missing

audit checklist for frontmatter:

is frontmatter present and valid yaml?
does description exist?
does description include "trigger when" or other explicit trigger language?
is description under 280 chars (embedding sweet spot)?
is description under 10 words per trigger phrase (clear signal)?

if frontmatter missing or description empty, mark as ✗ and cap final score at 4.0.

step 2, extract and validate the 6 required components

scan the SKILL.md body for these section headers (case-insensitive, exact match not required):

intent: what the skill exists to do, in 1-3 sentences, focuses on the why not the how
inputs: what the skill needs to run, tools, data, user context, external connections, prereqs
procedure: numbered or stepped sequence, each step has explicit inputs and outputs
decision points: if-else branches, fallbacks, edge cases ("if X then Y, else Z" patterns)
output contract: what the skill produces, data format, file location, expected length
outcome signal: how to know it worked, observable success criteria, measurable result

input: SKILL.md body text output: dict with key=component_name, value=extracted_text or null

for each component, assign a presence score:

0 = missing or blank
1 = present but under 50 words, vague, or lacks specificity
2 = present and substantive (50+ words, concrete details, actionable)

step 3, score trigger phrases in the description

parse the description field for explicit trigger language (phrases starting with "trigger when", "use this when", "activate if", etc.).

input: frontmatter description output: list of trigger phrases extracted

count unique high-signal phrases. score:

0 points if no trigger phrases found
1 point if 1-2 phrases found
2 points if 3-5 phrases found
3 points if 6+ distinct phrases found and each under 10 words

deduct 0.5 points if trigger phrases are generic (e.g. "when you need help") rather than use-case specific.

step 4, scan for anti-patterns and structural debt

for each anti-pattern detected, record it with severity (minor, major, critical):

vague procedure steps (major): step text lacks concrete tool calls, inputs, or outputs. e.g. "do X carefully" without saying how.

input: procedure section text
output: list of vague steps with line numbers

missing error handling (major): no decision points for common failure modes (empty results, auth failures, rate limits, network timeout, invalid input).

input: procedure and decision points sections
output: list of missing error branches

no measurable outcome (major): outcome signal is subjective (e.g. "user feels satisfied") instead of observable (e.g. "file created at X" or "API returns 200").

input: outcome signal section
output: true if measurable, false if subjective

skill body bloat (minor): body text over 8000 chars without clear justification.

input: full skill body
output: char count

missing external connection docs (major): inputs section mentions a tool (Salesforce, HubSpot, GitHub API) but does not specify env var name, auth scope, or setup steps.

input: inputs section text
output: list of undocumented connections

deduct points per anti-pattern:

critical = -2 points
major = -1 point per occurrence
minor = -0.5 points per occurrence

step 5, compute normalized score and generate report

calculate raw score:

component points (max 12, two points per component)
plus trigger phrase points (max 3)
minus anti-pattern deductions
raw max = 15 points

normalize to 0-10 scale: (raw_score / 15) * 10, round to one decimal place.

if description missing, cap at 4.0. if procedure missing, cap at 5.0.

input: all sub-scores from steps 1-4 output: final_score (0.0-10.0)

step 6, render the audit report

structure the report as:

overall score: [X.X / 10.0]

component checklist:

[✓ / ⚠ / ✗] intent:
[✓ / ⚠ / ✗] inputs:
[✓ / ⚠ / ✗] procedure:
[✓ / ⚠ / ✗] decision points:
[✓ / ⚠ / ✗] output contract:
[✓ / ⚠ / ✗] outcome signal:

(✓ = present and substantive, ⚠ = present but thin, ✗ = missing)

top 3 improvement suggestions (ranked by impact on tier-1 score):

[title]: [concrete action]. impact: +[X.X] points.
[title]: [concrete action]. impact: +[X.X] points.
[title]: [concrete action]. impact: +[X.X] points.

predicted tier-1 score after improvements: [X.X / 10.0]

input: all prior analysis output: markdown report as formatted above

decision points

if the skill body has no headers at all (flat narrative style): scan the text for the content of each component by keyword matching instead of strict header matching. give partial credit (1 point) per component if the content is present but unlabeled. note this in the report under "structure notes".
if multiple skills are in one file: flag this as an error and recommend splitting into separate SKILL.md files. do not score a multi-skill document, return a message "one SKILL.md per skill, please split and resubmit".
if the frontmatter is missing entirely: this is a hard fail. cap score at 4.0 and flag "frontmatter required for tier-1 matching".
if the description exists but contains zero trigger phrases: deduct all 3 trigger-phrase points and recommend adding 3-5 concrete "trigger when" statements.
if procedure steps exist but lack explicit input/output labels: mark the procedure as ⚠ (present but thin) and suggest adding "input: X, output: Y" to each step.

output contract

the skill renders a structured markdown audit report containing:

overall score (0-10 scale, one decimal precision)
per-component checklist with presence symbols (✓ / ⚠ / ✗) and one-line notes
top 3 ranked improvement suggestions with quantified impact on tier-1 score
predicted tier-1 score if author applies all suggestions
optional "structure notes" section if the skill deviates from standard layout

the report is plain markdown, under 2000 chars, suitable for sharing directly with the skill author or posting in a review thread.

outcome signal

after the author reads the audit and applies the top 3 suggestions, re-run this skill on the revised SKILL.md. the new score should be at least the predicted delta higher than the original score. for example, if the first audit scored 6.2 and predicted +2.5 points, the second audit should score 8.7 or higher.

if the score does not rise by the predicted amount after improvements are applied, the audit heuristics or point allocation may need tuning (report as feedback to the implexa scoring team).

credits: implexa