Skill Quality Evaluator - Score any skill on 6 dimensions. Catch 30% of skills that look good but fail silently. Based on Tessl Research 2026 findings.
--- name: skill-quality-eval version: 1.0.0 description: Skill Quality Evaluator - Score any skill on 6 dimensions. Catch 30% of skills that look good but fail silently. Based on Tessl Research 2026 findings. emoji: π tags: [evaluation, quality, skill, testing, reliability, ai-agent] --- # Skill Quality Evaluator π Score any skill on 6 dimensions. Catch the 30% of skills that look good but fail silently. ## Why This Matters Tessl Research (April 2026) found: - **20% accuracy gain** when using a good skill vs no skill - **3X cost savings** when small model + right skill matches large model - **40% activation rate** β agents often fail to use available skills - **30% of evaluation tasks have leakage** β skills that seem great but aren't This skill helps you evaluate and improve your skills systematically. ## 6-Dimension Evaluation ### 1. Activation Reliability (0-100) > Can the agent find and activate this skill when needed? Checklist: - [ ] Trigger words are specific and unambiguous - [ ] Description matches actual functionality - [ ] No conflicting skills with similar triggers - [ ] Skill is discovered when user asks relevant questions **Common Issues:** - Vague description β agent doesn't know when to use it - Missing trigger words β skill never activates - Too broad β activates when it shouldn't **Score Guide:** - 90+: Agent activates correctly 95%+ of the time - 70-89: Activates in most relevant contexts - 50-69: Sometimes activates, sometimes misses - <50: Agent rarely finds/uses this skill ### 2. Task Coverage (0-100) > Does the skill handle the tasks it claims to cover? Checklist: - [ ] Each claimed capability has a usage example - [ ] Edge cases are documented - [ ] Known limitations are stated - [ ] Failure modes are explained **Common Issues:** - Claims broad coverage but only handles happy path - No examples for secondary features - Undocumented prerequisites **Score Guide:** - 90+: All claimed tasks have working examples - 70-89: Main tasks covered, some gaps in secondary features - 50-69: Core functionality works but incomplete - <50: Major claims unsupported ### 3. Instruction Clarity (0-100) > Can the agent follow the instructions without confusion? Checklist: - [ ] Instructions are step-by-step, not vague guidelines - [ ] Decision points have clear criteria - [ ] Output format is specified - [ ] Anti-patterns are listed **Common Issues:** - "Do X when appropriate" β when is appropriate? - Missing priority/precedence rules - Contradictory instructions **Score Guide:** - 90+: Agent follows instructions correctly 95%+ of the time - 70-89: Mostly clear, occasional confusion - 50-69: Agent frequently asks for clarification - <50: Instructions are ambiguous or contradictory ### 4. Leakage Resistance (0-100) > Does the evaluation actually test the skill, or does it leak answers? Checklist: - [ ] Examples don't contain verbatim solutions - [ ] Test tasks require genuine skill application - [ ] No shortcut paths that bypass skill content - [ ] Evaluation criteria measure real capability **Common Issues (from Tessl Research):** - Example tasks are too similar to skill content - Skill contains answers verbatim - Test can be solved by pattern matching without understanding **Score Guide:** - 90+: No leakage, genuine skill testing - 70-89: Minor leakage that doesn't significantly inflate scores - 50-69: Moderate leakage, scores may be 10-20% inflated - <50: Major leakage, evaluation results unreliable ### 5. Model Compatibility (0-100) > Does the skill work across different model sizes? Checklist: - [ ] Tested with at least 2 model sizes - [ ] Works with smaller/cheaper models - [ ] Performance difference between models documented - [ ] Minimum model requirements stated **Tessl Finding:** Small model + right skill β Large model at 3X lower cost. **Score Guide:** - 90+: Works well with small models (haiku-level) - 70-89: Works with medium models (sonnet-level) - 50-69: Requires large models (opus-level) - <50: Even large models struggle ### 6. Real-World Value (0-100) > Does using this skill actually improve outcomes vs no skill? Checklist: - [ ] Measurable improvement over baseline - [ ] Users would notice the difference - [ ] Saves time or reduces errors - [ ] No negative side effects **Score Guide:** - 90+: Clear, significant improvement (20%+ accuracy gain) - 70-89: Noticeable improvement - 50-69: Marginal improvement - <50: No improvement or negative impact ## Evaluation Report Template ```markdown # Skill Evaluation Report **Skill**: [name] **Version**: [version] **Date**: YYYY-MM-DD **Evaluator**: [agent/session] ## Overall Score: XX/100 | Dimension | Score | Status | |-----------|-------|--------| | Activation Reliability | XX | π’/π‘/π΄ | | Task Coverage | XX | π’/π‘/π΄ | | Instruction Clarity | XX | π’/π‘/π΄ | | Leakage Resistance | XX | π’/π‘/π΄ | | Model Compatibility | XX | π’/π‘/π΄ | | Real-World Value | XX | π’/π‘/π΄ | π’ 80+ | π‘ 50-79 | π΄ <50 ## Critical Issues 1. [Issue] β [Fix] ## Improvement Recommendations 1. [Recommendation] β [Expected impact] ## Quick Wins (easy fixes, big impact) 1. [Fix] β +X points on [dimension] ``` ## Usage ### Evaluate a skill ``` Read the skill's SKILL.md and evaluate on all 6 dimensions. Generate the evaluation report. Save to memory/evaluations/<skill-name>-eval.md ``` ### Improve a skill based on evaluation ``` 1. Read evaluation report 2. Focus on lowest-scoring dimension 3. Apply quick wins first 4. Re-evaluate 5. Repeat until all dimensions β₯ 70 ``` ### Batch evaluate all skills ``` For each skill in skills/ directory: 1. Read SKILL.md 2. Evaluate on 6 dimensions 3. Generate report 4. Identify top 3 improvements Save summary to memory/evaluations/batch-report.md ``` ## Anti-Patterns to Detect | Pattern | Issue | Fix | |---------|-------|-----| | "Do X when appropriate" | Vague trigger | Define specific conditions | | No examples | Agent can't learn | Add 3+ concrete examples | | Only happy path | Fragile in production | Add error handling examples | | Verbatim solutions | Leakage risk | Use different examples for eval | | No model requirements | Unknown compatibility | Test with 2+ model sizes | ## License MIT
don't have the plugin yet? install it then click "run inline in claude" again.