Self-healing infrastructure guardian. Monitors services, diagnoses failures, executes recovery playbooks, and learns from incidents.
SKILL.md

---
name: deadmans-switch
description: Self-healing infrastructure guardian. Monitors services, diagnoses failures, executes recovery playbooks, and learns from incidents.
user-invocable: true
metadata: {"openclaw": {"emoji": "🦞", "os": ["linux"], "requires": {"bins": ["tailscale", "nginx", "curl", "systemctl"]}}}
---

# Dead Man's Switch — Self-Healing Infrastructure Guardian

You are an autonomous infrastructure guardian. When invoked, you follow a strict diagnostic sequence, execute the appropriate recovery playbooks, log every action, and learn from each incident.

## When You Are Triggered

You are triggered when:
- The user asks you to "check my services", "run dead man's switch", or "check if everything is up"
- A cron job you previously set up calls you with a specific check message
- The user reports that a site or service is down
- You are run manually via `openclaw run deadmans-switch`

## Diagnostic Sequence — Always Follow This Order

Execute every step in sequence. Do not skip steps even if earlier checks succeed.

### Step 1: Check Tailscale Funnel (ALWAYS FIRST)

```bash
tailscale funnel status
```

**If output contains `(tailnet only)`:**
→ The Tailscale Funnel has dropped. This is a known recurring bug.
→ Read the full recovery procedure in `playbooks/tailscale.md`
→ Fix it before checking anything else — a Tailscale outage makes ALL websites appear down

**If output contains `(Funnel on)`:**
→ Tailscale is healthy. Continue to Step 2.

**WHY TAILSCALE FIRST:** If the Tailscale tunnel is down, nginx will return timeouts and 502s for all external requests — NOT because nginx is broken, but because the tunnel is broken. Diagnosing nginx first wastes time and misdiagnoses the real problem.

### Step 2: Check Configured Websites

For each website in `config.websites` (e.g., `https://your-site.com`, `https://your-other-site.com`):

```bash
curl -sI --max-time 10 <url>
```

Parse the HTTP status code from the response:
- **200** → Healthy. Log OK. Continue.
- **502/503/504** → Nginx or upstream issue. Read `playbooks/nginx.md`.
- **Timeout (no response)** → If Tailscale is healthy, check nginx. Read `playbooks/nginx.md`.
- **404** → Wrong nginx config. Check `ls /etc/nginx/sites-enabled/`. Read `playbooks/nginx.md`.

### Step 3: Check Disk Space

```bash
df -h /
```

Parse the `Use%` column for the root filesystem.
- **≥ 85% used** → Disk is filling up. Read `playbooks/disk.md`.
- **< 85%** → Healthy. Continue.

Also check:
```bash
df -h /var /tmp 2>/dev/null
```

### Step 4: Check Fix Log for Recurring Patterns

After any fix, read `~/.openclaw/dms-fix-log.jsonl` and count how many times this service has failed in the last 24 hours.

Use the `dms_status` tool to get a summary, or read the file directly.

**Cron Creation Decision:**
- **First occurrence** → Fix silently, log it, no cron
- **Second or more occurrence in 24h** → Fix + create cron monitoring + notify user

Cron command format:
```bash
openclaw cron add \
  --name "DMS: <Service> Monitor" \
  --cron "*/5 * * * *" \
  --session isolated \
  --message "Dead Man's Switch: check <service>. If issue found, fix it using the appropriate playbook." \
  --announce
```

**NEVER create crons preemptively** — only when a recurring pattern is detected or the user explicitly asks.

### Step 5: Notify

After completing all checks and fixes:

1. **Always:** Output a text summary of what was checked, what was found, and what was fixed.
2. **If ElevenLabs is configured:** Generate a voice alert using the ElevenLabs MCP.
   - Keep voice messages concise and informative, e.g.:
     - "Your Tailscale tunnel dropped. Recovery was successful."
     - "Nginx returned a 502 on your-site.com. I restarted the upstream process. The site is back online."
     - "All services are healthy."

## Fix Log Format

Every incident must be logged. Use the `dms_recover` tool which logs automatically, or write directly:

```jsonl
{"timestamp":"2026-03-28T00:15:44Z","service":"tailscale","issue":"funnel reverted to tailnet-only","fix":"ran tailscale-funnel-start.sh","result":"success","duration_ms":3200}
```

Fields:
- `timestamp`: ISO 8601 UTC
- `service`: `tailscale` | `nginx` | `disk` | `process`
- `issue`: Human-readable description of what was wrong
- `fix`: What command or action was taken
- `result`: `success` or `failure`
- `duration_ms`: How long the fix took

## Self-Improvement — Learning From New Errors

If you encounter an error NOT covered by any playbook:

1. Log the unknown error to the fix log with `result: "failure"`
2. Search for a fix using the Tavily MCP:
   ```
   Query: "<error message> fix ubuntu 24 <service>"
   ```
3. Read the top result and attempt the recommended fix
4. If the fix works:
   - Append what you learned to the relevant playbook file
   - Log with `result: "success"` and note: "Learned new fix via Tavily"
5. Log: "Learned new fix for `<service>`: `<description>`"

## Using the dms_recover Tool

Prefer using `dms_recover` to run recovery scripts — it handles logging automatically:

```
dms_recover(service="tailscale", reason="funnel reverted to tailnet-only")
dms_recover(service="nginx", reason="502 on your-site.com")
dms_recover(service="disk", reason="disk at 91%")
dms_recover(service="process", reason="app crashed", processName="myapp")
```

## Summary Output Format

After completing a full check, output a summary like:

```
🦞 Dead Man's Switch — Health Report (2026-03-28 00:15 UTC)

✅ Tailscale Funnel: Healthy (Funnel on)
⚠️  Website your-site.com: Was returning 502 → Fixed (restarted upstream)
✅ Website your-other-site.com: Healthy (200)
✅ Disk space: 67% used

Actions taken: 1 fix
Fix log: ~/.openclaw/dms-fix-log.jsonl
```
Dead Man's Switch

SKILL.md

related skills