Detects and mitigates hallucinations in agent outputs by self-checking facts, verifying claims, and correcting unsupported or contradictory information.
# SKILL.md - Anti-Hallucination Protocol
> *"The first principle is that you must not fool yourself — and you are the easiest person to fool."* — Richard Feynman
A runtime hallucination detection and mitigation skill for OpenClaw agents. Recognises the cognitive and behavioral signs of hallucination, then intervenes to restore grounded reasoning.
**Based on 2026 Research:** HalluClear, MARCH, AgentHallu, Epistemic Stability, CRITIC, MetaCognition Patterns, ToolHalla Guardrails.
## The Philosophy
**Detection > Prevention.** Hallucinations cannot be fully prevented — LLMs generate text by predicting probable tokens, not by verifying truth. The question is not whether your agent will hallucinate. It is whether your agent catches itself when it does.
**Self-Awareness > External Guardrails.** An agent that monitors its own reasoning is more effective than one that relies solely on post-hoc validation. The metacognitive loop — observe, critique, correct — must be internal.
**Specificity > Generality.** Generic "be careful" instructions fail. Specific sign recognition, concrete intervention protocols, and measurable confidence thresholds succeed.
## When to Activate
**Automatic triggers — ANY of these activates the anti-hallucination protocol:**
- [ ] Agent makes a factual claim without citation or source
- [ ] Agent generates a file path, URL, or identifier that does not exist
- [ ] Agent reports success without verifying the result
- [ ] Agent provides a specific date, name, or number from memory without checking
- [ ] Agent expresses high confidence (>90%) on a complex, uncertain topic
- [ ] Agent contradicts information in its own context or memory files
- [ ] Agent produces a tool call with parameters it cannot verify
- [ ] Agent offers analysis on data it has not actually read
- [ ] Agent describes system state without checking live status
- [ ] User expresses doubt: "Are you sure?" / "Can you verify that?"
**Implicit triggers (monitor continuously):**
- [ ] Tool call returns error but agent continues as if successful
- [ ] Agent invents plausible-sounding but unverified details
- [ ] Agent generalises from a single example
- [ ] Agent uses absolute language ("always", "never", "certainly") on probabilistic topics
## The Hallucination Taxonomy
Know what you're looking for:
| Type | Description | Example |
|------|-------------|---------|
| **Intrinsic Factual** | Contradicts source material | Claims file exists when `read` returned error |
| **Intrinsic Semantic** | Misrepresents meaning | Misreads config flag, draws wrong conclusion |
| **Intrinsic Temporal** | Wrong timing/sequence | "Yesterday I did X" when memory shows no record |
| **Extrinsic Factual** | Adds unverifiable but plausible info | Invents a specific version number not in docs |
| **Extrinsic Non-Factual** | Adds obviously false info | Claims a feature exists that was never built |
| **Reasoning Error** | Correct facts, wrong conclusion | "Disk is 90% full, therefore upgrade needed" (ignores tmp files) |
| **Tool Hallucination** | Fabricates tool results | Reports command output without running it |
| **Self-Hallucination** | False memory of own actions | "I already fixed that" when fix not in git |
## The Recognition Protocol (5-Second Self-Check)
Before ANY output that contains facts, claims, or recommendations, ask:
```markdown
### Reality Check (5s)
1. SOURCE: Do I have direct evidence for this claim? (file read, tool output, live check)
2. VERIFICATION: Can I verify this right now with a tool call?
3. CONFIDENCE: Am I >80% confident? If yes, am I >95% confident? Flag if yes.
4. MEMORY: Is this from a file I actually read this session, or "feels right"?
5. CONTRADICTION: Does this contradict anything in my context or memory?
```
**If ANY check fails:** Escalate to Grounding Protocol (below).
## The Grounding Protocol (When Signs Detected)
### Step 1: Stop and Flag
```
⚠️ HALLUCINATION CHECK TRIGGERED
Type: [intrinsic/extrinsic/reasoning/tool/self]
Claim: [the specific claim being questioned]
Confidence: [self-assessed %]
Evidence: [what I have / what I lack]
```
### Step 2: Verify or Withdraw
**If verifiable in <30s:**
- Run the tool call to check
- Report actual result
- Update confidence based on evidence
**If not immediately verifiable:**
- Withdraw the claim
- Replace with: "I do not have direct evidence for [X]. My sources: [list]."
- Offer to verify if user wants
**If partially verifiable:**
- Downgrade confidence explicitly
- Distinguish verified from inferred: "Confirmed: [A]. Inferred: [B]."
### Step 3: Document the Correction
Add to `memory/YYYY-MM-DD.md`:
```markdown
### Hallucination Correction — [Time]
- Claim: [what was wrong]
- Type: [taxonomy type]
- How caught: [which trigger fired]
- Correction: [what replaced it]
- Lesson: [pattern to watch for]
```
## The Confidence Calibration Rules
**Never express certainty you don't have:**
| Situation | Max Confidence Allowed | Required Action |
|-----------|------------------------|-----------------|
| Read file this turn | 95% | Cite line number |
| Read file earlier | 85% | Re-read if challenged |
| Memory from past session | 70% | Flag as "from memory" |
| Inferred from pattern | 60% | State inference chain |
| Heard in training data | 50% | Treat as unverified |
| Pure intuition | 30% | Do not state as fact |
## The Tool-Use Guardrails
**Before reporting tool results:**
1. Did the tool actually execute? (check for error output)
2. Did I read the full output? (not just first few lines)
3. Did I understand the output correctly? (re-read if ambiguous)
4. Did I report what it says, not what I expected it to say?
**Common tool hallucinations to watch for:**
- Reporting `grep` results without checking if match is real
- Claiming file exists based on path construction, not `ls`/`test`
- Interpreting error messages as success (e.g., "not found" = "confirmed absent")
- Summarising JSON without parsing it properly
- Inventing exit codes ("command returned 0" when you didn't check)
## The Multi-Agent Validation Pattern
When available (C1/C2/C3 coordination):
```markdown
### Cross-Agent Verification
1. State claim to peer agent
2. Peer evaluates: [agree / disagree / cannot verify]
3. If disagree: both re-check sources
4. Consensus required for >90% confidence claims
5. Log disagreement in coordination channel
```
**For single-agent operation:** Use simulated peer review — state the claim, then critique it as if from an adversarial position.
## The Metacognitive Loop (Continuous)
Every 5-10 minutes of active work, or at natural breakpoints:
```markdown
### Metacognitive Checkpoint
- [ ] What have I claimed since last checkpoint?
- [ ] Which claims were verified vs assumed?
- [ ] Did any tool call fail silently?
- [ ] Am I building on a potentially false foundation?
- [ ] Should I re-verify my starting assumptions?
```
## Recovery Patterns
**When caught hallucinating:**
1. **Acknowledge immediately.** Do not double down. Do not deflect. "I was wrong about [X]."
2. **Correct explicitly.** State the correction clearly, not buried in explanation.
3. **Explain the gap.** "I stated [X] because [reason]. The actual state is [Y]."
4. **Update memory.** Log the pattern so future-you watches for it.
5. **Do not apologise excessively.** One clear correction beats three apologies.
**When uncertain mid-task:**
1. **State uncertainty.** "I am not confident about [X]. Here is what I know: [...]"
2. **Offer verification path.** "I can check this by running [tool]."
3. **Do not guess to maintain flow.** A pause for verification beats a cascade of errors.
## Integration with OpenClaw
**Add to AGENTS.md startup checks:**
```markdown
## Anti-Hallucination Protocol
Before any factual claim:
1. Run 5-Second Self-Check
2. If triggered, execute Grounding Protocol
3. Log corrections to memory
```
**Add to every SKILL.md:**
```markdown
## Hallucination Risks
[List domain-specific hallucination patterns for this skill]
```
**Add to TOOLS.md:**
```markdown
## Tool Verification Checklist
- [ ] Command executed successfully?
- [ ] Full output read and understood?
- [ ] Result reported accurately, not inferred?
```
## Metrics
Track in `memory/hallucination-log.md`:
```markdown
## 2026-05-13 — Session Log
- Total claims made: [N]
- Verified claims: [N]
- Hallucinations caught: [N]
- Hallucinations missed (user caught): [N]
- Recovery time: [avg seconds]
```
## Anti-Patterns (What NOT to Do)
- ❌ "I believe..." — belief without evidence is a red flag
- ❌ "It should be..." — should is not is. Check.
- ❌ "As I mentioned earlier..." — verify you actually mentioned it
- ❌ "The system is..." — which system? When did you last check?
- ❌ "That means..." — does it? Trace the inference chain
- ❌ "Obviously..." — obvious to whom? On what evidence?
## Sources
- ToolHalla.ai (2026) — AI Hallucination Guardrails That Actually Work
- Zylos Research (2026) — MetaCognition Patterns for AI Agent Self-Monitoring
- Zylos Research (2026) — LLM Hallucination Detection: State of the Art
- CallSphere.ai (2026) — Hallucination Detection and Mitigation in AI Agent Systems
- arXiv:2604.17284 — HalluClear: Diagnosing Hallucinations in GUI Agents
- arXiv:2603.24579 — MARCH: Multi-Agent Reinforced Self-Check
- arXiv:2603.10047 — Toward Epistemic Stability
- arXiv:2601.06818 — AgentHallu: Benchmarking Hallucination Attribution
---
*Version 1.0 — May 2026 — Based on 2026 research landscape*
*Remember: The agent that catches itself hallucinating is more valuable than the agent that never does.*
don't have the plugin yet? install it then click "run inline in claude" again.