LLM Cost Watchdog

Monitors real-time LLM API costs, detects runaway loops, enforces budgets, audits code risk, and reports usage across multiple providers and models.

view source

installs

stars

karma

SkillRank score ↗

8.3/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-06-29

llm-cost-watchdog tracks real-time api spend across 30+ providers, detects unbounded loops via ast analysis, enforces budget ceilings mid-execution, and surfaces silent failures. pricing routes through litellm and openrouter with static fallback.

structure

9.0

trigger phrases

9.0

procedure

8.0

edge cases

8.0

documentation

8.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: cost-watchdog
description: Tracks LLM spend across providers live, detects runaway loops, enforces budgets. Triggers on: cost/budget/token mentions, LLM API calls, agent workflows, batch processing.
when_to_use: "TRIGGER when: cost/budget/token mentions, LLM API calls in code, agent loops, batch processing, or `/cost-watchdog` commands."
argument-hint: "[command] — session, tail, detect, audit, price, estimate, alternatives, report, errors, validate-tokens, reset"
metadata: {"openclaw": {"emoji": "💰"}}
---

# Cost Watchdog 💰

> Real-time cost tracking layer for LLM-based agents. Prices every call live,
> detects runaway loops in code, enforces budget ceilings mid-execution.

## 1. Identity

Observes LLM spend without disturbing the agent. Prevents `$2,400-overnight-loop`
disasters by making cost a first-class concern: priced at write time,
budgeted at check time, surfaced in reports.

## 2. Triggers

Activate when:

- User mentions cost, budget, tokens, billing.
- Code contains LLM API calls (Anthropic, OpenAI, OpenRouter, Google, Groq, ...).
- Agent loops or recursive workflows.
- Batch / streaming processing with unclear bounds.
- `/cost-watchdog [command]` is invoked.

## 3. Commands

Run via `python3 scripts/cost_watchdog.py <cmd>` (or hook into your own CLI).

| Command | What it does |
|---|---|
| `session` | Spend totals from `usage.jsonl` — calls, tokens, cost, top models. |
| `report` | 24h / 7d / 30d windows with top model per window. |
| `tail [--once]` | Watch OpenClaw session JSONL and log every assistant turn. |
| `detect [--json]` | Identify which model the agent is currently using (5 probes). |
| `audit <file.py>` | AST-based code risk scan: unbounded loops, recursion, missing `max_tokens`. |
| `price <model>` | Live pricing for one model, with source + cache age. |
| `estimate <model>` | Project cost for `n` iterations of a given call. |
| `alternatives <model>` | Cheaper same-unit models. |
| `errors [--limit N]` | Recent swallowed exceptions (silent failures made visible). |
| `validate-tokens <model>` | Compare our heuristic against provider's authoritative count. |
| `reset [--all]` | Clear current-day log (`--all` also clears rolled files). |

## 4. Pricing layer

### Source chain

```
openrouter/*            → OpenRouter API (live)           → static fallback
anything else           → LiteLLM JSON (live, cached 24h) → OpenRouter (permissive) → static fallback
```

- **2600+ models** indexed across chat, completion, embedding, image, audio,
  video, rerank, OCR, search modes.
- **30+ providers** in the static fallback: Anthropic, OpenAI, Google,
  Groq, Mistral, Cohere, DeepSeek, Perplexity, xAI, Bedrock, Azure, and more.
- Unit-aware: `token`, `image`, `second`, `query`, `page`, `character`,
  `pixel`. Alternatives never compare across units.
- **Circuit breaker** opens after 3 consecutive network failures for a host;
  falls through to cache/static until the cool-down ends (60s).

### Tuning

| Env var | Default | Effect |
|---|---|---|
| `CW_PRICE_TTL_SECONDS` | 86400 (24h) | Cache lifetime. `0` = hit network every call. |
| `CW_OFFLINE` | unset | If `1`, never touch the network. |
| `CW_STATIC_ONLY` | unset | If `1`, skip live sources entirely. Used by tests. |
| `CW_LOG_DIR` | `~/.cost-watchdog` | Where usage/errors/cache files live. |
| `CW_BUDGET_USD` | unset | Ceiling; wrappers raise `BudgetExceeded` when crossed. |

### Refresh static pricing

```
python3 scripts/refresh_pricing.py
```
Regenerates `references/pricing.md` from the live sources so the offline
fallback is fresh. Aborts if fewer than 100 rows came back (protects
against clobbering on a network outage).

## 5. Tracking layer — how we know what was spent

Four independent paths, all write to `~/.cost-watchdog/usage.jsonl`:

| Path | When to use | Covers streams? |
|---|---|---|
| `openclaw_tailer.py --watch` | Running OpenClaw. Zero code changes. | yes (reads completed turns) |
| `track_openai(client)` | You call OpenAI-compatible SDK (covers OpenRouter, Groq, DeepSeek, Mistral, Together, Fireworks, Cerebras, Anyscale, ...). | yes (tee'd iterator, auto-injects `stream_options={"include_usage": True}`) |
| `track_anthropic(client)` | Direct Anthropic SDK. | yes (wraps `messages.stream()`) |
| `track_gemini(model)` / `track_cohere(client)` / `track_bedrock(client)` | Direct provider SDKs. | no (add wrappers if you need streams) |
| `install_global_capture()` (httpx) | Any modern Python SDK using httpx. | **no** — streams are flagged into `errors.jsonl` so the gap is visible. Use the SDK wrappers for stream coverage. |

Usage log rotates daily: `usage.YYYY-MM-DD.jsonl`. `session_total(since=...)`
skips files outside the window before scanning.

Aggregation uses `canonical_family()` so
`claude-haiku-4-5-20251001`, `claude-haiku-4-5`, and `claude-haiku-4.5`
are one row in reports.

## 6. Budget enforcement

Two mechanisms:

1. **Write-time check** (race-safe): `append_usage(entry, budget_ceiling=X)`
   takes an `fcntl.flock` on a sidecar, sums the current session, and
   refuses the write (raises `BudgetExceeded`) if the call would cross `X`.
2. **Post-write check**: wrappers compare cumulative spend to `CW_BUDGET_USD`
   after logging and raise if over. Used when the wrapper doesn't know the
   ceiling at call time.

Either path stops the agent mid-loop; the LLM call still returns to the
caller, but the next one blocks.

## 7. Code audit (AST)

```
python3 scripts/cost_watchdog.py audit path/to/agent.py
```

Walks the AST and reports:

- **CRITICAL** — `while True` with an LLM call and no `max_iterations`-style bound.
- **CRITICAL** — function that recurses and calls an LLM API with no depth argument.
- **HIGH** — plain `while` that calls an API with no retry/iteration counter.
- **MEDIUM** — LLM call missing `max_tokens` / `max_completion_tokens`.
- **MEDIUM** — function with ≥5 sequential LLM calls (batching candidate).

Every finding has a file line number. No more `count('def ') > 3 and
count('self.') > 5 → "recursion detected"` false positives.

## 8. Detection — "what model is the agent using?"

```
python3 scripts/cost_watchdog.py detect
```

Five probe layers, ranked by confidence:

| Probe | Confidence |
|---|---|
| OpenClaw session JSONL | high |
| Claude Code session JSONL | high |
| Most recent usage-log entry | high |
| Claude Code `settings.json` | medium |
| Env vars (`ANTHROPIC_MODEL`, `OPENAI_MODEL`, ...) | medium |

Emits a table or `--json`.

## 9. Files

| Path | Purpose |
|---|---|
| `scripts/_pricing.py` | Router: picks LiteLLM / OpenRouter / static per query. |
| `scripts/_sources.py` | Three `PricingSource` classes + disk cache + circuit breaker. |
| `scripts/tokenizer.py` | Provider-aware token counting (tiktoken for OpenAI; calibrated heuristics for others). |
| `scripts/model_canon.py` | `canonical_family()` — collapses model variants. |
| `scripts/code_audit.py` | AST cost-risk walker. |
| `scripts/usage_log.py` | JSONL writer + rotation + aggregation. |
| `scripts/tracker.py` | SDK wrappers + streaming + budget enforcement. |
| `scripts/http_capture.py` | `install_global_capture()` — httpx transport hook. |
| `scripts/openclaw_tailer.py` | Watches OpenClaw sessions. |
| `scripts/detect_model.py` | Multi-layer detector. |
| `scripts/errors.py` | `errors.jsonl` writer + reader. |
| `scripts/io_utils.py` | `write_json_atomic` / `read_json`. |
| `scripts/refresh_pricing.py` | Regenerates static `pricing.md` from live sources. |
| `scripts/cost_watchdog.py` | Unified CLI dispatcher. |
| `references/pricing.md` | Static fallback (regenerated; ~2600 models). |
| `tests/test_cost_watchdog.py` | 73 tests: router, cache, AST, tokenizer, rotation, cassettes, circuit breaker, canonicalization. |

## 10. Quality checklist

- [x] Live pricing from LiteLLM + OpenRouter, 24h-cached, with static fallback.
- [x] Exact-match model lookup (no substring conflation).
- [x] Multi-modal (token / image / second / query / page / character).
- [x] Unit-aware alternatives (never compares tokens to images).
- [x] AST-based code audit with line numbers.
- [x] Provider-aware tokenization (no more tiktoken-for-Claude).
- [x] Variance-based confidence (no `+= 0.05` theater).
- [x] Atomic writes to all shared state files.
- [x] `fcntl.flock`-guarded budget check-and-log (no race).
- [x] Circuit breaker on flaky networks (no 5s hang per call).
- [x] Streaming capture via SDK wrappers; streams flagged in `errors.jsonl` via HTTP capture.
- [x] Daily log rotation + date-scoped aggregation.
- [x] Canonical model families (variants collapse in reports).
- [x] `errors.jsonl` surfaces silent failures; `cost_watchdog errors` shows them.
- [x] Cassette tests for LiteLLM + OpenRouter parse paths (schema-drift safety net).
- [x] 73 logic tests passing.

## 11. Known limits (be honest)

- **Tokenizer heuristics** for Claude/Gemini/etc. are calibrated from docs,
  not measured. Run `cost_watchdog validate-tokens <model>` to check drift
  against the provider's authoritative count when you have an API key.
- **`install_global_capture()` can't see streaming responses** — httpx exposes
  an empty body until the user reads the stream. Use `track_openai` /
  `track_anthropic` for stream coverage; `http_capture` logs skipped streams
  to `errors.jsonl` so the gap is visible.
- **Non-httpx SDKs** (older Cohere, boto3 with custom transport) need the
  per-SDK wrappers — HTTP capture won't see them.
- **LiteLLM community data** can lag 24-48h on brand-new models. OpenRouter's
  API is truly live for anything it routes.

## 12. Testing

```
python3 -m unittest tests.test_cost_watchdog     # 73 tests
python3 scripts/code_audit.py test_risky_code.py # sample risks
python3 scripts/cost_watchdog.py report          # current spend summary
```

related skills

semantically similar in the cross-vendor index

clawhub

74% match

Agent Lens

Track AI agent API calls, analyze token usage, and optimize costs. Use when user wants to monitor LLM spending, debug API calls, track token consumption, or...

don't have the plugin yet? install it then click "run inline in claude" again.

---
name: llm-cost-watchdog
description: monitors real-time llm api costs, detects runaway loops, enforces budgets, audits code risk, and reports usage across multiple providers and models.
---

llm cost watchdog 💰

intent

cost watchdog tracks spending across llm providers in real time, blocks runaway loops before they drain your budget, audits code for cost risks, and surfaces usage patterns across openai, anthropic, google, groq, openrouter, and 30+ other providers. use this when you're running agents, batch processing, or any workflow that talks to multiple llm apis and you need visibility into spend, assurance that you won't wake up to a $2400 bill, and a way to detect which model your agent is actually calling.

inputs

environment variables:

CW_PRICE_TTL_SECONDS (default: 86400) - how long to cache pricing data from live sources before hitting the network again. set to 0 to disable caching.
CW_OFFLINE (optional, set to 1) - if present, never make network calls for pricing. falls back to disk cache and static reference only.
CW_STATIC_ONLY (optional, set to 1) - skip live sources entirely and use only the bundled static pricing. used in tests and airgapped environments.
CW_LOG_DIR (default: ~/.cost-watchdog) - filesystem path where usage logs, error logs, and cache files are stored.
CW_BUDGET_USD (optional) - spending ceiling in usd. if set, the skill will block any llm call that would push the session total over this amount.

external connections:

litellm api - provides live pricing for 2600+ models across 30+ providers. cached for 24h. requires internet access.
openrouter api - authoritative source for openrouter/* models. also acts as a permissive fallback when litellm is unavailable.
provider sdks - anthropic, openai, google generativeai, cohere, groq, bedrock (boto3), mistral, deepseek, together, fireworks, cerebras, anyscale. the skill wraps these to intercept calls and log spend.
httpx - modern python http library. global capture hook watches all httpx traffic as a last-resort tracking mechanism.

filesystem:

~/.cost-watchdog/usage.jsonl - append-only log of every llm call. rotates daily to usage.YYYY-MM-DD.jsonl.
~/.cost-watchdog/errors.jsonl - swallowed exceptions, stream detection gaps, tokenization mismatches.
~/.cost-watchdog/pricing_cache.json - cached pricing from litellm and openrouter.
references/pricing.md - bundled static fallback for 2600 models. regenerated via refresh_pricing.py.

code context:

python 3.9+
ast module (stdlib)
fcntl (stdlib, unix/linux only; windows uses a polyfill)
tiktoken (for openai tokenization)
anthropic, openai, google-generativeai, litellm (as optional imports; graceful degrades if missing)

procedure

step 1: initialize tracking for your provider

input: sdk instance (openai client, anthropic client, etc.) or agent environment

output: wrapped client ready to log all calls to usage.jsonl

choose one:

openai-compatible (covers openai, openrouter, groq, deepseek, mistral, together, fireworks, cerebras, anyscale):
```
from cost_watchdog.tracker import track_openai
client = track_openai(client)  # wraps the client in place
```
this automatically injects stream_options={"include_usage": True} so streaming calls are metered too.

anthropic:

from cost_watchdog.tracker import track_anthropic
client = track_anthropic(client)

wraps messages.stream() and messages.create().

google gemini:

from cost_watchdog.tracker import track_gemini
model = track_gemini(model_id)

cohere:

from cost_watchdog.tracker import track_cohere
client = track_cohere(client)

bedrock:

from cost_watchdog.tracker import track_bedrock
client = track_bedrock(client)

openclaw agent (zero code changes):
```
python3 scripts/openclaw_tailer.py --watch &
```
runs in background, reads completed turns from openclaw session jsonl, logs to usage.jsonl.
any modern python sdk via httpx (fallback):
```
from cost_watchdog.tracker import install_global_capture
install_global_capture()
```
hooks the httpx transport globally. note: does not capture streaming responses (body is empty until user reads). use per-sdk wrappers instead.

step 2: check pricing for a model

input: model identifier (string, e.g. gpt-4-turbo, claude-3-5-sonnet-20241022, gemini-2-0-flash)

output: pricing dict with input cost, output cost, cache cost, source, and age

python3 scripts/cost_watchdog.py price gpt-4-turbo

returns:

{
  "model": "gpt-4-turbo",
  "input_cost_per_1m_tokens": 10.0,
  "output_cost_per_1m_tokens": 30.0,
  "cache_read_cost_per_1m_tokens": 2.5,
  "cache_write_cost_per_1m_tokens": 30.0,
  "unit": "token",
  "source": "litellm",
  "cached_at": "2025-01-15T14:22:00Z",
  "cache_age_seconds": 3600
}

routing logic: openrouter/* models → openrouter api. anything else → litellm api (live, cached 24h) → openrouter (as permissive fallback) → bundled static pricing.

if network is unavailable and cache miss: uses static fallback. if CW_OFFLINE=1: skips network entirely.

step 3: estimate cost for a workflow

input: model id, number of iterations or calls, rough token count per call

output: projected spend and breakdown

python3 scripts/cost_watchdog.py estimate gpt-4-turbo --iterations 100 --input-tokens 500 --output-tokens 1000

returns:

{
  "model": "gpt-4-turbo",
  "iterations": 100,
  "input_tokens_per_call": 500,
  "output_tokens_per_call": 1000,
  "total_input_tokens": 50000,
  "total_output_tokens": 100000,
  "estimated_cost_usd": 3.50,
  "breakdown": {
    "input": 0.50,
    "output": 3.00
  }
}

step 4: find cheaper alternatives

input: model id

output: list of models in the same unit with lower or comparable cost

python3 scripts/cost_watchdog.py alternatives claude-3-5-sonnet-20241022

returns alternatives ranked by output cost, with same unit (tokens to tokens, never tokens to images). includes cost delta and speed (if known).

step 5: audit agent code for cost risks

input: python file path

output: list of risk findings with severity, line number, and remediation hint

python3 scripts/cost_watchdog.py audit path/to/agent.py

scans the ast and detects:

critical: while True with an llm call and no iteration bound (max_iterations, max_retries).
critical: function that recurses and calls an llm api with no depth argument.
high: plain while loop calling an api with no counter or max iterations.
medium: llm call missing max_tokens or max_completion_tokens.
medium: function with 5+ sequential llm calls (batching candidate).

example output:

audit_results:
  - severity: CRITICAL
    line: 42
    rule: while_true_with_api_no_bound
    context: "while True: response = client.messages.create(...)"
    hint: add max_iterations or break condition
  - severity: MEDIUM
    line: 88
    rule: missing_max_tokens
    context: "client.messages.create(model=..., max_tokens=...)"
    hint: add max_tokens to cap output length

step 6: detect which model the agent is using

input: none (reads from env, config files, session logs)

output: model identifier with confidence level

python3 scripts/cost_watchdog.py detect
python3 scripts/cost_watchdog.py detect --json

probes in order:

openclaw session jsonl (high confidence).
claude code session jsonl (high confidence).
most recent entry in ~/.cost-watchdog/usage.jsonl (high confidence).
claude code settings.json (medium confidence).
env vars like ANTHROPIC_MODEL, OPENAI_MODEL, GOOGLE_MODEL (medium confidence).

step 7: view session totals

input: optional time window (defaults to current day)

output: summary of calls, tokens, cost, and top models

python3 scripts/cost_watchdog.py session
python3 scripts/cost_watchdog.py session --since 2025-01-10 --until 2025-01-15

returns:

{
  "window": {
    "since": "2025-01-15T00:00:00Z",
    "until": "2025-01-15T23:59:59Z"
  },
  "totals": {
    "calls": 342,
    "input_tokens": 125000,
    "output_tokens": 45000,
    "cost_usd": 18.75
  },
  "top_models": [
    {
      "family": "gpt-4-turbo",
      "calls": 210,
      "cost_usd": 12.50
    },
    {
      "family": "claude-3-5-sonnet",
      "calls": 132,
      "cost_usd": 6.25
    }
  ]
}

model names are canonicalized (claude-haiku-4-5-20251001, claude-haiku-4-5, claude-haiku-4.5 appear as one row).

step 8: view time-windowed spend report

input: optional granularity (24h, 7d, 30d windows)

output: spend per window with top model per window

python3 scripts/cost_watchdog.py report
python3 scripts/cost_watchdog.py report --window 7d

returns:

{
  "windows": [
    {
      "start": "2025-01-15T00:00:00Z",
      "end": "2025-01-15T23:59:59Z",
      "cost_usd": 18.75,
      "calls": 342,
      "top_model": "gpt-4-turbo"
    },
    {
      "start": "2025-01-14T00:00:00Z",
      "end": "2025-01-14T23:59:59Z",
      "cost_usd": 22.10,
      "calls": 401,
      "top_model": "gpt-4-turbo"
    }
  ]
}

step 9: tail live usage in real time

input: none (watches session jsonl)

output: formatted log of each completed llm call

python3 scripts/cost_watchdog.py tail
python3 scripts/cost_watchdog.py tail --once

emits:

[2025-01-15T14:22:45Z] gpt-4-turbo: 1250 in, 340 out → $0.032
[2025-01-15T14:22:47Z] claude-3-5-sonnet-20241022: 890 in, 120 out → $0.005

use --once to read current state and exit (no tailing).

step 10: validate token counting

input: model id, optionally a sample message or file

output: comparison of our heuristic vs. provider's true count

python3 scripts/cost_watchdog.py validate-tokens gpt-4-turbo
python3 scripts/cost_watchdog.py validate-tokens claude-3-5-sonnet-20241022 --message "hello world"

sends a real api call (you must have credentials) and compares tokenization. reports drift if our heuristic is off by >5%. helps catch tokenizer calibration issues.

step 11: check for swallowed errors

input: optional limit (defaults to 50)

output: recent silent failures and tracking gaps

python3 scripts/cost_watchdog.py errors
python3 scripts/cost_watchdog.py errors --limit 20

includes:

exceptions caught and not re-raised.
streaming responses that httpx could not capture.
tokenization mismatches.
api auth failures.

example:

{
  "timestamp": "2025-01-15T14:22:45Z",
  "type": "stream_not_captured",
  "context": "openai streaming response",
  "message": "httpx global capture cannot see streaming body. use track_openai wrapper."
}

step 12: reset logs

input: optional --all flag

output: cleared usage/error logs

python3 scripts/cost_watchdog.py reset              # clears today's log
python3 scripts/cost_watchdog.py reset --all        # clears all rolled logs too

idempotent. useful after testing or to start fresh tracking.

step 13: validate and refresh static pricing

input: none (fetches from litellm and openrouter)

output: regenerated references/pricing.md with 2600+ models

python3 scripts/refresh_pricing.py

queries litellm api and openrouter api, merges results, validates against schema, writes to disk. aborts if fewer than 100 rows returned (protects against clobbering during a network outage). run this weekly or after major provider price changes.

decision points

if you're running openclaw agents: use openclaw_tailer.py --watch in background. zero code changes. covers all model calls and streaming automatically.

else if you're using openai-compatible sdk (openai, openrouter, groq, deepseek, mistral, together, fireworks, cerebras, anyscale): call track_openai(client) once at startup. covers streaming via auto-injected stream_options.

else if you're using anthropic sdk directly: call track_anthropic(client) at startup. wraps both messages.create() and messages.stream().

else if you're using google gemini, cohere, or bedrock: use the corresponding track_* wrapper for your provider. note: most of these do not yet cover streaming; add httpx wrappers if you need stream metering.

else if you're using a modern python sdk via httpx (fallback): call install_global_capture() once. warning: does not capture streaming responses. if your sdk streams, use the per-provider wrapper instead and log the gap to errors.jsonl.

if litellm and openrouter apis are unavailable (network down, rate limited, or CW_OFFLINE=1): fall back to disk cache. if cache is also stale or miss, use bundled static pricing in references/pricing.md.

if you set CW_BUDGET_USD: cost watchdog will raise BudgetExceeded exception and block the next llm call once session spend crosses the ceiling. the current call still returns to the caller; only subsequent calls are blocked. check logs to see which call triggered the limit.

if a model is brand new (hours old): litellm may not have it yet. openrouter api is the fastest source for new models. if both miss, static fallback is stale. run refresh_pricing.py or manually check provider's pricing page.

if you're auditing code and see high/medium findings: not all are blockers. missing max_tokens is a best practice, not a guarantee of runaway cost. while loops with api calls need human judgment: a retry loop with exponential backoff and max_retries is safe; a loop that recurses infinitely is not. use findings as signal, not law.

if token count validation shows drift >5%: your agent's token heuristic is off. either recalibrate in tokenizer.py for that provider, or set max_tokens lower as a safety margin.

if errors.jsonl shows stream_not_captured: your code is using streaming and httpx is the fallback. switch to the per-sdk wrapper (e.g. track_openai, track_anthropic) to get accurate stream metering.

output contract

all commands write structured json or plaintext to stdout. all persistent state goes to ~/.cost-watchdog/ unless CW_LOG_DIR is set.

usage.jsonl (append-only, rotates daily):

{
  "timestamp": "2025-01-15T14:22:45.123Z",
  "model": "gpt-4-turbo",
  "provider": "openai",
  "input_tokens": 1250,
  "output_tokens": 340,
  "cache_read_tokens": 0,
  "cache_write_tokens