Runbook for diagnosing failed cron jobs, LaunchAgents, heartbeats, and unattended automation by reproducing the scheduler context, preflighting dependencies,...
---
name: cron-failure-runbook
version: 1.0.0
description: "Runbook for diagnosing failed cron jobs, LaunchAgents, heartbeats, and unattended automation by reproducing the scheduler context, preflighting dependencies, and closing with verified evidence."
author: nissan
tags:
- cron
- operations
- runbook
- automation
metadata:
openclaw:
emoji: "⏱️"
network:
outbound: false
---
# Cron Failure Runbook
Use when a scheduled job, LaunchAgent, cron task, heartbeat step, or nightly automation fails, silently no-ops, produces incomplete output, or repeatedly generates dream-cycle failure proposals.
## Goal
Turn unattended failures into reproducible evidence and one of three outcomes:
1. Fixed and verified.
2. Deferred with owner/date/reason.
3. Escalated with the exact missing credential, approval, service, or runtime condition.
## Procedure
1. Identify the scheduler context.
- Job name, plist/cron entry, command, cwd, shell, user, and expected environment.
- Last successful run and last failed/no-op run.
2. Reproduce in the same runtime lane.
- Run the exact command manually with the same env source where practical.
- Capture stdout, stderr, exit code, cwd, PATH, and relevant env variable presence without printing secret values.
- If the job depends on OpenClaw model calls, verify it uses gateway/Codex routing rather than raw OPENAI_API_KEY.
3. Run preflights before the expensive or external step.
- Auth: prove the running process can read the needed secret and make the smallest live API call.
- Files: prove input paths exist and output directories are writable.
- Network/service: prove target health endpoint or API is reachable.
- Approval: prove an external write has approval or a preapproved workflow flag.
4. Classify the failure.
- auth: missing/expired token, wrong vault, wrong runtime env, insufficient scope.
- runtime: wrong shell, PATH, Python/Node version, cwd, launchd env, permissions.
- input: missing/stale source files, empty queue, unexpected schema.
- external: API outage, 401/403, rate limit, deploy provider issue.
- logic: script exits zero but produces no expected artifact/action.
5. Close the loop.
- Fix code/config if local and reversible.
- Add a dry-run or preflight mode if the job cannot be safely tested live.
- Update the relevant STATUS/runbook/memory with evidence.
- If unresolved, record blocker, owner, next command, and alert threshold.
## Verification Evidence
Every cron fix needs at least one of:
- Manual reproduction command with exit code and expected output.
- preflight-only or dry-run output proving dependencies are healthy.
- Scheduler log excerpt showing the next run succeeded.
- A deliberate deferred/blocked entry with owner, reason, and next check date.
## Dream-Cycle Specific Checks
For dream-cycle failures:
- bash -n scripts/dream-cycle.sh
- python3 -m py_compile for every Python script touched by the cycle.
- scripts/task-quality-judge.py --since 7 --dry-run
- scripts/skill-evolver.py --since 7 --min-failures 2 --dry-run
- scripts/dream-recurring-issues.py --since 7 --min-count 3 --dry-run
- scripts/dream-cycle-action-summary.py --since-hours 26 --dry-run
Do not mark dream-cycle work complete if proposal files are merely pending. There must be a lifecycle status, a summary, and a next action.
don't have the plugin yet? install it then click "run inline in claude" again.
added explicit inputs section with secret and external connection guidance, expanded procedure steps with input/output pairs and edge cases for auth expiry and rate limits, extracted implicit decision logic into decision points section, documented dream-cycle verification as output contract requirements, and added outcome signal for scheduler success verification.
---
name: cron-failure-runbook
slug: cron-failure-runbook
description: runbook for diagnosing failed cron jobs, launchagents, heartbeats, and unattended automation by reproducing scheduler context, preflighting dependencies, and closing with verified evidence
author: nissan
tags:
- cron
- operations
- runbook
- automation
metadata:
openclaw:
emoji: "⏱️"
network:
outbound: false
---
# Cron Failure Runbook
## intent
use this skill when a scheduled job, launchagent, cron task, heartbeat step, or nightly automation fails silently, no-ops, produces incomplete output, or repeatedly generates failed proposals. the goal is to turn unattended failures into reproducible evidence and land on one of three outcomes: fixed and verified, deferred with owner/date/reason, or escalated with the exact missing credential, approval, service, or runtime condition.
## inputs
- scheduler context: job name, plist or cron entry, command string, working directory, shell, user account, and expected environment variables
- access to the machine or container where the job runs
- ability to execute commands with the same user/permissions as the scheduler
- (optional) vault or secret manager connection if job uses auth tokens
- (optional) openclaw gateway/codex routing config if job calls models
- (optional) access to scheduler logs (cron.log, system.log, launchd logs on macos)
- (optional) access to input files, output directories, and external api endpoints the job depends on
## procedure
1. identify the scheduler context.
- input: job name or identifier
- action: locate the job entry (cron tab, launchd plist, systemd unit, github actions workflow, etc.), extract the command string, working directory, shell, user account, and source of environment variables
- action: check scheduler logs for the last successful run timestamp and the last failed or no-op run timestamp
- output: documented context (cwd, shell, user, env source, last-run and last-fail dates)
2. reproduce in the same runtime lane.
- input: documented scheduler context from step 1
- action: run the exact command manually, using the same shell, user, and cwd where practical
- action: if the job sources environment from a profile or rc file, source that file first
- action: capture stdout, stderr, exit code, and the value of PATH and any job-specific env variables without printing secret values
- action: if the job makes openclaw model calls, verify it routes through gateway/codex rather than using raw OPENAI_API_KEY
- output: reproduction output with exit code, stdout/stderr excerpt, and env snapshot (redacted)
- edge case: if the job runs as a different user, use `sudo -u <user>` or `su - <user>` to maintain environment isolation
- edge case: network timeouts or service latency may cause timeouts in manual runs; note timing differences between manual and scheduled contexts
3. run preflights before expensive or external steps.
- input: reproduction output and job command
- action: auth preflight: prove the running process can read the needed secret (try `cat $SECRET_PATH` or `vault kv get` with `--dry-run` if available), then make the smallest live api call to verify token validity and scope
- action: files preflight: prove input paths exist (`test -f` or `ls`), output directories are writable (`touch` a test file), and schemas match (sample first row of input files if applicable)
- action: network/service preflight: curl the target health endpoint or api with a short timeout, confirm response code and latency
- action: approval preflight: if the job writes to prod or makes external changes, confirm an approval workflow was executed or a preapproved flag is set
- output: preflight checklist with pass/fail for auth, files, network, approval
- edge case: api rate limits may block preflight calls; use backoff or skip if quota is low
- edge case: auth tokens may have expired since last successful run; check token expiry time if readable
4. classify the failure.
- input: reproduction output, preflights, and scheduler logs
- action: map the failure to one or more categories:
- auth: missing or expired token, wrong vault, wrong runtime env, insufficient scope
- runtime: wrong shell, PATH, python/node version mismatch, working directory not found, permissions denied
- input: missing or stale source files, empty queue, unexpected schema, size limits exceeded
- external: api outage, 401/403, rate limit hit, deploy provider issue, network unreachable
- logic: script exits zero but produces no expected artifact or downstream action
- output: classified failure type with supporting evidence from preflights and logs
5. close the loop.
- input: classified failure and reproduction steps
- action: if failure is local and reversible, fix code or config and rerun the reproduction from step 2 to confirm
- action: if failure cannot be safely tested live, add a dry-run or preflight-only mode and document the change
- action: update status documents, runbook notes, or shared memory with evidence and the resolution (fixed, deferred, or escalated)
- action: if unresolved, record the exact blocker (e.g., "waiting for api key from oncall"), owner, next diagnostic command to run, and alert threshold (how many consecutive failures before page)
- output: documented resolution with date, action taken, and next check date if deferred
## decision points
- if reproduction succeeds but scheduler fails: the issue is in the scheduler environment (env vars, user permissions, working directory). go back to step 1 and audit launchd/cron config for env overrides.
- if reproduction fails and preflights fail: the issue is auth, files, or network. fix the auth/file/network issue and retest step 2.
- if reproduction fails but preflights pass: the issue is logic or a race condition. add logging to the job script and rerun from a scheduler context (wait for next cron tick or trigger manually).
- if the job uses openclaw and makes raw api calls instead of gateway routing: reroute through gateway/codex to enable consistent logging and rate limiting.
- if auth token is expired: rotate or refresh the token, update the vault/secret store, and confirm the scheduler can read the new value.
- if input files are stale or missing: confirm the upstream job that produces them succeeded. add a dependency check to the job (e.g., "fail if input file is older than 24 hours").
- if rate limit is hit: add exponential backoff to the job, check quota usage in the api console, or request a higher limit.
- if the job is deferred pending an approval or external action: set a reminder to follow up by the target date, and add the blocker to an oncall handoff doc.
## output contract
successful resolution of a cron failure produces one of the following artifacts:
1. fixed: merged commit or deployed config change that restores the job; manual reproduction shows exit code 0 and expected output
2. deferred: documented entry in runbook or status board with owner name, reason (e.g., "waiting for oncall to rotate api key"), target resolution date, and the exact next diagnostic command
3. escalated: documented blocker with the missing resource (api key name, approval workflow, service name, permission), owner, and a link to a ticket or page owner
any fix or deferral must include at least one of:
- manual reproduction command with exit code and expected output
- preflight-only or dry-run output proving all dependencies are healthy
- scheduler log excerpt (last 10 lines) showing the next run succeeded after the fix
- a deliberate deferred entry with owner, reason, next check date, and the unresolved command that will be rerun
dream-cycle specific jobs must include output from:
- `bash -n scripts/dream-cycle.sh` (syntax check, no errors)
- `python3 -m py_compile` for every python script touched by the cycle
- `scripts/task-quality-judge.py --since 7 --dry-run` (quality score unchanged or improved)
- `scripts/skill-evolver.py --since 7 --min-failures 2 --dry-run` (no new regressions)
- `scripts/dream-recurring-issues.py --since 7 --min-count 3 --dry-run` (recurring issues stable or decreasing)
- `scripts/dream-cycle-action-summary.py --since-hours 26 --dry-run` (summary output valid)
do not mark dream-cycle work complete if proposal files are pending. there must be a lifecycle status, summary, and next action documented.
## outcome signal
you know the skill worked when:
- the cron job runs at the next scheduled time and produces the expected output or artifact
- scheduler logs show exit code 0 and no errors in stderr
- the manual reproduction command from step 2 succeeds with the same output
- a deferred or escalated entry is filed, acknowledged by owner, and has a target follow-up date
- no duplicate failure alerts fire in the next 24 hours for the same job
- downstream jobs or processes that depend on this job resume normal operation