clawhub

Clean Log Toolkit

Local log file inspection and analysis toolkit. Parse common log formats (apache-common, apache-combined, nginx-access, syslog, JSON-line) or custom regex wi...

view source

installs

stars

karma

SkillRank score ↗

8.3/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

clean-log-toolkit parses common log formats (apache, nginx, syslog, json-line) into structured csv/tsv/jsonl, aggregates errors by level and time bucket with message fingerprinting, and provides log-aware grep with time windows and context filters. pure python 3 stdlib, no dependencies.

structure

9.0

trigger phrases

9.0

procedure

9.0

edge cases

7.0

documentation

8.0

view original SKILL.md from clawhubclick to expand

---
name: clean-log-toolkit
description: Local log file inspection and analysis toolkit. Parse common log formats (apache-common, apache-combined, nginx-access, syslog, JSON-line) or custom regex with named groups into structured TSV/CSV/JSONL. Aggregate errors by level and time bucket (minute/hour/day), surface the most common error groups via fingerprint normalization, and produce JSON/Markdown/CSV reports. Grep log lines with optional time-window (--since/--until), level filter, named-group regex, and -B/-A/-C context lines. Pure Python 3 standard library, no third-party dependencies, no remote calls.
license: MIT
metadata: {"openclaw":{"requires":{"bins":["python3"]},"primaryEnv":null,"homepage":"https://clawhub.ai/gopendrasharma89-tech/clean-log-toolkit"}}
---

# clean-log-toolkit

v0.2.0

A small, honest local toolkit for the work agents end up doing constantly: read a log someone sent you, figure out the format, find the actual problems, and produce a summary you can paste into a ticket. Built on Python 3 standard library only. No `awk`/`sed`/`jq` wrappers, no pip installs, no remote calls.

This skill is the third of the "clean-*" trio:
- [`clean-csv-toolkit`](https://clawhub.ai/gopendrasharma89-tech/clean-csv-toolkit) — structured tabular data
- [`clean-text-toolkit`](https://clawhub.ai/gopendrasharma89-tech/clean-text-toolkit) — unstructured text
- **`clean-log-toolkit`** — semi-structured timestamped logs

## What this skill does

- `scripts/parse.py` — parse a log file into structured rows. Auto-detects `apache-common`, `apache-combined`, `nginx-access`, `syslog`, and `json-line` formats by sniffing the first ~50 lines. Falls back to a generic timestamp + level + message extractor when nothing matches. Pass `--regex PATTERN` with named groups to define a custom format. Output as `.csv`, `.tsv`, or `.jsonl`.
- `scripts/errors.py` — aggregate the errors in a log file. Counts by level (WARN / ERROR / FATAL by default), buckets the timeline by minute / hour / day, normalizes each message into a "fingerprint" (replaces numbers, UUIDs, hex tokens, file:line pairs, and embedded timestamps with placeholders) and surfaces the top-N most frequent error groups. Writes a JSON / Markdown / CSV report or prints a one-screen summary.
- `scripts/grep.py` — grep, but log-aware. Combine `--pattern REGEX`, `--not-pattern REGEX`, `--level LVL[,LVL2...]`, `--since TIMESTAMP`, `--until TIMESTAMP`, and `-B / -A / -C` context lines into one filter pass.
- `scripts/follow.py` (NEW in v0.2.0) — `tail -F` equivalent, log-aware. Streams new lines as they arrive with the same `--pattern` / `--not-pattern` / `--level` / `--since` filters as `grep.py`. Detects log rotation automatically (inode change or file truncation reopens the file). `--max-events N` exits cleanly after N matched events (CI-friendly); `--timeout SECONDS` exits on inactivity; `--json` emits per-line envelopes with extracted timestamp + level.
- `scripts/check_deps.sh` — verify `python3` is available.

## What this skill does not do

- Live tail-and-follow is now supported via `scripts/follow.py` (added v0.2.0).
- It does not call any LLM, web service, or remote API.
- It does not write outside the input/output paths the caller provides.

## Quick start

### 1. Parse an unknown log file

```bash
# Auto-detect the format
python3 scripts/parse.py app.log app.csv

# Or be explicit
python3 scripts/parse.py access.log out.jsonl --format apache-combined
python3 scripts/parse.py syslog.txt out.csv --format syslog
python3 scripts/parse.py events.log out.csv --format json-line --fields ts,level,msg
```

### 2. Custom format via named-group regex

```bash
python3 scripts/parse.py app.log structured.csv \
    --regex '^(?P<ts>\S+)\s+(?P<level>\S+)\s+(?P<message>.*)$'
```

### 3. Aggregate errors and produce a report

```bash
# One-screen summary
python3 scripts/errors.py app.log

# Bucket by minute, top 20 message groups
python3 scripts/errors.py app.log --bucket minute --top 20

# Only count specific levels
python3 scripts/errors.py app.log --level ERROR,FATAL

# Write a Markdown report ready to paste into a ticket
python3 scripts/errors.py app.log --output report.md

# Or a JSON report for downstream tooling
python3 scripts/errors.py app.log --output report.json --bucket hour

# Or a CSV of the timeline only
python3 scripts/errors.py app.log --output timeline.csv --bucket minute
```

`errors.py` fingerprints messages so repeated errors that only differ in numbers / UUIDs / file-line refs collapse to one group with a count. Example: 50 occurrences of `Connection timeout to 10.0.0.5 after 1234ms` and `Connection timeout to 10.0.0.7 after 567ms` collapse into one group `Connection timeout to <N>.<N>.<N>.<N> after <N>ms` with count 50.

### 4. Log-aware grep

```bash
# Pattern + level filter
python3 scripts/grep.py app.log --pattern "Database" --level ERROR,FATAL

# Time window
python3 scripts/grep.py app.log \
    --since "2026-05-10T10:00:00Z" \
    --until "2026-05-10T11:00:00Z"

# Context lines (1 before + 1 after each match)
python3 scripts/grep.py app.log --pattern "FATAL" -C 1 --with-line

# Exclude noisy lines while keeping the rest
python3 scripts/grep.py app.log --level ERROR --not-pattern "heartbeat"

# Invert: keep everything that does NOT match
python3 scripts/grep.py app.log --pattern "INFO" --invert
```

`--since` and `--until` accept the same timestamp formats `parse.py` understands: ISO 8601 (`2026-05-10T10:00:00Z`, `2026-05-10 10:00:00`, with or without microseconds / timezone), apache-style (`10/May/2026:10:00:00 +0000`), and syslog (`May 10 10:00:00` — current year assumed).

## Exit codes

| Code | Meaning |
|---|---|
| 0 | success / one or more rows / one or more matches |
| 1 | parse produced zero rows / grep found zero matches / errors found zero matching log entries |
| 2 | bad arguments / unsafe path / missing input / bad regex / unknown format / unsupported output extension |

This 0 / 1 / 2 split is consistent across all three scripts so they slot into shell pipelines cleanly:

```bash
# Parse to JSONL, then summarize errors, then post to a ticket
python3 scripts/parse.py raw.log structured.jsonl \
  && python3 scripts/errors.py raw.log --output ticket.md \
  && cat ticket.md
```

## Safety properties

- Pure Python 3 standard library. No third-party dependencies.
- No `subprocess` calls. No shell invocation.
- All file paths are validated against a strict allowlist regex that rejects shell metacharacters. The same `safe_path()` helper used in `clean-csv-toolkit` and `clean-text-toolkit`.
- Scripts only read the input paths the caller provides and write to the output paths the caller provides.
- All inputs default to UTF-8; reads fall back through `utf-8-sig`, `cp1252`, `latin-1` if needed. Writes are always UTF-8.

## Timestamp + level detection

`_common.py` ships a pragmatic timestamp parser that tries the following formats in order, picking the first that matches:

```
2026-05-10T10:00:00.123456+00:00     (ISO 8601 with TZ + microseconds)
2026-05-10T10:00:00+00:00            (ISO 8601 with TZ)
2026-05-10T10:00:00.123Z              (ISO 8601 UTC Zulu)
2026-05-10T10:00:00Z                  (ISO 8601 UTC Zulu)
2026-05-10T10:00:00                   (ISO 8601 no TZ)
2026-05-10 10:00:00                   (space-separated)
2026/05/10 10:00:00
10/May/2026:10:00:00 +0000           (apache common log)
May 10 10:00:00                       (syslog, no year)
```

Levels are detected case-insensitively from these tokens and folded to canonical names: `TRACE`, `DEBUG`, `INFO`, `NOTICE`, `WARN` (from WARN/WARNING), `ERROR` (from ERROR/ERR), `FATAL` (from FATAL/CRITICAL/CRIT/EMERG/EMERGENCY).

## Known limitations

- The regex-based parsers are pragmatic, not strict — they accept slightly malformed Apache / nginx / syslog lines as long as the structure is close enough.
- `errors.py` fingerprint normalization is a best-effort heuristic. Two semantically different errors that happen to differ only in numbers / hashes will be collapsed; if that matters, use `--top` with a larger N and inspect the samples.
- `parse.py` does not follow a live log file. For tail-follow, pipe `tail -F file | ...` into your own tool. If there's enough demand for a built-in follower, it will land in v0.2.

## Pairs well with

- [`clean-csv-toolkit`](https://clawhub.ai/gopendrasharma89-tech/clean-csv-toolkit) — pipe `parse.py` output (CSV / JSONL) into `inspect`, `validate`, `pivot`, or `transform` to turn raw logs into reportable tables.
- [`clean-text-toolkit`](https://clawhub.ai/gopendrasharma89-tech/clean-text-toolkit) — pair `parse.py` with `text-toolkit/redact.py` to scrub PII before sharing log dumps.

## v0.2.0 changes

- Added `scripts/follow.py`: live tail-and-follow with log-aware filtering. Same `--pattern` / `--not-pattern` / `--level` / `--since` filters as `grep.py`. Automatic log-rotation detection (inode change or truncation triggers a transparent reopen). `--max-events N` and `--timeout SECONDS` make it CI-friendly; `--json` emits one envelope per matched line with extracted timestamp + level. Closes the only documented limitation from v0.1.x.

## v0.1.1 changes

- Fixed timestamp parser: `--since` and `--until` on `grep.py` now accept date-only values like `2026-05-09`, `2026/05/09`, and `09/05/2026` (European). Previously only full ISO 8601 timestamps were accepted, so users trying to filter by a calendar date got a `could not parse --since` error.

## v0.1.0 changes

- First public release of clean-log-toolkit.
- Three scripts: `parse.py`, `errors.py`, `grep.py`.
- Shared `_common.py` with `safe_path`, `iter_lines`, `parse_timestamp`, `extract_timestamp`, `extract_level` helpers (mirrors the design of `clean-csv-toolkit/scripts/_common.py` and `clean-text-toolkit/scripts/_common.py`).
- Auto-detects 5 log formats by sniffing the first 50 lines.
- Zero third-party dependencies; works on any system that ships Python 3.

## License

MIT

related skills

semantically similar in the cross-vendor index

clawhub

74% match

Log Analyzer

Analyze application logs to produce actionable error digests with pattern detection, severity classification, trend analysis, and remediation recommendations...

don't have the plugin yet? install it then click "run inline in claude" again.

clean-log-toolkit

v0.2.0

intent

clean-log-toolkit reads a local log file, auto-detects the format (apache-common, apache-combined, nginx-access, syslog, json-line, or custom regex), and outputs structured data (CSV/TSV/JSONL) or error summaries. use it when you need to parse unknown log formats, aggregate errors by level and time bucket, grep with log-aware filters (timestamp + level + pattern), or tail a live log file with rotation detection. built on Python 3 standard library only, no external dependencies or remote calls.

inputs

log file path (required)

local file, readable by the calling user. assumed UTF-8; falls back through utf-8-sig, cp1252, latin-1 if needed.
edge case: empty files produce exit code 1 (zero rows / zero matches).
edge case: very large files (>1 GB) are read line-by-line to avoid memory spike.

output file path (required for parse.py, errors.py; optional for grep.py and follow.py)

must be a valid filesystem path without shell metacharacters. validated against strict allowlist regex.
extension determines output format: .csv, .tsv, .jsonl, .md, .json. if not provided, output goes to stdout.

--format (optional, parse.py only)

one of: apache-common, apache-combined, nginx-access, syslog, json-line. if omitted, auto-detected by sniffing first 50 lines. if auto-detection fails, falls back to generic timestamp + level + message extraction.

--regex PATTERN (optional, parse.py only)

custom regex with named groups (e.g., (?P<ts>\S+), (?P<level>\S+), (?P<message>.*)). takes precedence over --format. if provided, auto-detection is skipped.

--fields (optional, parse.py only)

comma-separated list of named group names to include in output (e.g., ts,level,msg). if omitted, all groups are included.

--level (optional, errors.py, grep.py, follow.py)

comma-separated list of log levels to include (case-insensitive). canonical names: TRACE, DEBUG, INFO, NOTICE, WARN, ERROR, FATAL. default: WARN,ERROR,FATAL. if a line has no detected level, it is skipped.

--bucket (optional, errors.py only)

one of: minute, hour, day. groups the error timeline into buckets. default: hour.

--top (optional, errors.py only)

integer, number of top-N error fingerprints to report. default: 10.

--pattern REGEX (optional, grep.py, follow.py)

only include lines matching this regex (case-sensitive). applied after level filter.

--not-pattern REGEX (optional, grep.py, follow.py)

exclude lines matching this regex (case-sensitive). applied after --pattern.

--since TIMESTAMP (optional, grep.py, follow.py)

include only lines on or after this timestamp. accepts iso 8601 (2026-05-10T10:00:00Z), space-separated (2026-05-10 10:00:00), apache-style (10/May/2026:10:00:00 +0000), syslog-style (May 10 10:00:00), or date-only (2026-05-09, 2026/05/09, 09/05/2026). if --since is date-only, it starts at 00:00:00 of that day.

--until TIMESTAMP (optional, grep.py, follow.py)

include only lines before or on this timestamp. same formats as --since. if --until is date-only, it includes through 23:59:59 of that day.

-B / -A / -C (optional, grep.py only)

context lines. -B N prints N lines before each match, -A N prints N lines after, -C N prints N before + N after. default: 0 (no context).

--with-line (optional, grep.py only)

prepend line number to each output line.

--invert (optional, grep.py only)

invert the match: include lines that do NOT match --pattern. if --not-pattern is also provided, that still applies.

--max-events N (optional, follow.py only)

exit cleanly after N matched events. useful for CI pipelines.

--timeout SECONDS (optional, follow.py only)

exit on inactivity. if no new lines arrive for SECONDS, the process exits with code 0. useful for log rotation or end-of-stream detection.

--json (optional, follow.py only)

emit one JSON envelope per matched line, with extracted timestamp and level. default: plain text.

--output PATH (optional, errors.py only)

write report to file instead of stdout. format inferred from extension: .md, .json, .csv.

python3 binary

must be available in PATH. version 3.7+. verify with python3 --version.

procedure

parse.py: parse a log file into structured rows

input: log file path, optionally --format, --regex, --fields, output path
validate: check input path is readable, output path is safe (no shell metacharacters).
sniff format: if --regex is provided, skip sniffing and use custom pattern. otherwise, if --format is provided, use it directly. otherwise, read first 50 lines and try to match apache-common, apache-combined, nginx-access, syslog, json-line patterns in that order.
extract rows: iterate through all lines in the log file. for each line, apply the detected or custom regex and extract named groups into a dict. skip lines that do not match the pattern.
filter fields: if --fields is provided, keep only those columns. otherwise, include all matched named groups.
serialize: write all rows to output file (or stdout if no output path) in the format determined by output extension (.csv, .tsv, .jsonl, or json-line if extension is .jsonl).
output: success returns exit code 0 if >=1 row extracted. zero rows extracted returns exit code 1. invalid args / bad regex / unsafe path returns exit code 2.
output location: output file (e.g., structured.csv) or stdout.

errors.py: aggregate errors by level, time bucket, and fingerprint

input: log file path, optionally --level, --bucket, --top, --output
validate: check input path is readable, output path is safe (if provided).
parse all lines: iterate through the log file. for each line, extract timestamp and level using the same heuristic as grep.py (does not need auto-format detection; generic extraction is sufficient).
filter by level: if --level is provided, keep only lines with those levels. default: WARN,ERROR,FATAL.
bucketize: group each line by its timestamp bucketed into the requested granularity (minute, hour, or day). within each bucket, track count of lines.
fingerprint normalization: for each message, replace numbers (including floats, negatives), uuids (8-4-4-4-12 hex), hex tokens (runs of hex digits), file:line pairs (e.g., foo.py:123), and embedded timestamps with placeholders. two messages that differ only in these values are collapsed into one fingerprint group.
rank: sort all fingerprints by frequency (descending). keep top --top (default 10).
format report: if --output is provided, write in the format of the extension (.md for markdown, .json for structured json, .csv for timeline). if no --output, print a one-screen summary to stdout.
output: success returns exit code 0 if >=1 matching log entry found. zero matches returns exit code 1. invalid args / unsafe path returns exit code 2.
output location: output file (e.g., report.md) or stdout.

grep.py: grep with log-aware filters

input: log file path, optionally --pattern, --not-pattern, --level, --since, --until, -B/-A/-C, --with-line, --invert, output path
validate: check input path is readable. validate regex patterns compile. validate --since and --until parse as valid timestamps.
parse timestamps: extract timestamp from each line using the same generic heuristic as errors.py. if timestamp cannot be parsed, skip the line (no match).
filter by level: if --level is provided, keep only lines with those levels. if --level is omitted, no level filtering.
filter by time window: if --since is provided, skip lines with timestamp < --since. if --until is provided, skip lines with timestamp > --until.
filter by pattern: if --pattern is provided, keep only lines where the full line matches the pattern (case-sensitive). if --invert is set, invert this filter (keep lines that do NOT match).
filter by not-pattern: if --not-pattern is provided, skip lines matching this pattern (case-sensitive).
add context: if -B/-A/-C is provided, for each matching line, include N surrounding lines before / after (or both). context lines are printed even if they do not match --pattern.
format output: if --with-line is set, prepend line number to each line. output to stdout or file (if output path provided).
output: success returns exit code 0 if >=1 match found. zero matches returns exit code 1. invalid args / bad regex / bad timestamp / unsafe path returns exit code 2.
output location: stdout or output file.

follow.py: tail -F equivalent with log-aware filtering

input: log file path, optionally --pattern, --not-pattern, --level, --since, --max-events, --timeout, --json
validate: check input path is readable. validate regex patterns compile.
open file: open the log file for reading. record the inode and file size.
tail from end (or --since): if --since is not provided, seek to end of file and start reading new lines. if --since is provided, scan from beginning and skip lines with timestamp < --since, then resume tailing.
loop: block on new data (read loop). for each new line, apply filters (--level, --pattern, --not-pattern) as in grep.py.
emit matched lines: for each match, either print plain text or emit a json envelope (if --json is set) with fields: timestamp, level, message, line_number.
detect rotation: after each read, check if the inode has changed or file size has decreased (truncation). if so, close the file, reopen it, and resume tailing.
exit conditions: exit cleanly (code 0) when --max-events is reached, or --timeout seconds elapse without new data, or EOF is reached (unlikely in tail mode).
output: all matches printed to stdout or file.

check_deps.sh: verify python3 is available

input: none
check: run which python3 or python3 --version. if successful, exit 0. if not found, print diagnostic and exit 2.

decision points

Auto-format detection vs. explicit --format vs. custom --regex

if --regex is provided, skip sniffing and use the custom pattern (parse.py).
else if --format is provided, use it directly without sniffing.
else sniff the first 50 lines and try apache-common, apache-combined, nginx-access, syslog, json-line in order. use the first match.
if sniffing finds no match, fall back to generic timestamp + level + message extraction.

Empty or malformed input

if the input file is empty or contains zero lines, return exit code 1 (no rows / no matches).
if a line is malformed and does not match the detected format, skip it (do not halt).

Level detection

if a line has no detected level keyword (TRACE, DEBUG, INFO, NOTICE, WARN, WARNING, ERROR, ERR, FATAL, CRITICAL, CRIT, EMERG, EMERGENCY), treat it as level-less and skip it in grep.py, errors.py, and follow.py unless --level is omitted.
if --level is omitted, include all lines regardless of level.

Timestamp parsing in grep.py and follow.py

if a line cannot be parsed for a timestamp, skip it (no match in grep.py; do not tail it in follow.py).
if --since or --until is a date-only value (e.g., 2026-05-09), expand it to a time range: --since becomes 2026-05-09T00:00:00, --until becomes 2026-05-09T23:59:59.

Fingerprint collisions in errors.py

two semantically different errors that differ only in numbers / uuids / hashes will be collapsed into the same fingerprint. if this is a problem, increase --top to inspect more groups or manually grep for specific patterns.

Log rotation in follow.py

if inode changes or file size decreases (truncation), the file has rotated. close the current handle and reopen the file by path.
if --max-events or --timeout is set, exit cleanly before attempting to reopen.

Character encoding

all files are assumed UTF-8. if a read fails, fall back through utf-8-sig, cp1252, latin-1 in order. if all three fail, skip the line.
all output is written as UTF-8.

Context lines in grep.py

context lines are printed in order (before matches, then the match, then after matches). if two matches overlap in their context windows, do not duplicate lines.
context lines that precede the first match or follow the last match in a file are not printed (only context between and within matches).

Path safety

all input and output paths are validated against a strict allowlist regex that rejects shell metacharacters (backticks, $, |, &, ;, <, >, etc.). unsafe paths cause exit code 2.

output contract

parse.py output

CSV: comma-separated, RFC 4180 compliant. header row with named group names.
TSV: tab-separated. header row with named group names.
JSONL: one JSON object per line, keys are named group names, values are strings.
on zero rows: exit code 1 (no output file written).
on success: exit code 0, output file created with >=1 row.

errors.py output

markdown report (.md): human-readable summary with timestamp, section for each top-N fingerprint, count, and example lines.
json report (.json): structured object with keys: total_entries, total_buckets, bucket_unit (minute/hour/day), top_errors (array of {fingerprint, count, examples, bucket_timeline}).
csv report (.csv): timeline only. columns: bucket, count, fingerprint.
stdout summary (no --output): one-screen text dump with top-N fingerprints and counts.
on zero matches: exit code 1 (no output file written).
on success: exit code 0, output file created or summary printed.

grep.py output

plain text, one matched line per output line.
if --with-line is set, prepend line number (1-indexed) before each line.
if -B/-A/-C is set, context lines are printed in order without modification.
on zero matches: exit code 1 (no output).
on success: exit code 0, one or more lines printed to stdout or file.

follow.py output

plain text: one matched line per output line as it arrives.
if --json is set: one json envelope per matched line, with keys: timestamp, level, message, line_number.
on --timeout or --max-events reached: exit code 0.
on error (invalid regex, bad path): exit code 2.

error cases

invalid arguments (unrecognized flag, bad regex, unparseable timestamp): exit code 2, print diagnostic to stderr.
unsafe file path (shell metacharacters detected): exit code 2, print diagnostic to stderr.
input file not found or not readable: exit code 2, print diagnostic to stderr.
zero rows / zero matches found: exit code 1.
success: exit code 0.

outcome signal

parse.py: the output file (CSV/TSV/JSONL) is created and contains one header row plus N data rows. you can open it in a spreadsheet or pipe it to another tool (clean-csv-toolkit, etc.).
errors.py: the report file (Markdown/JSON/CSV) is created and contains >=1 error group summary. you can paste the Markdown into a ticket, or parse the JSON for downstream tooling, or load the CSV into a spreadsheet.
grep.py: matched lines are printed to stdout or file. each line includes the original log text, context lines if requested, and line number if --with-line is set.
follow.py: new matching lines appear in stdout or file in real-time as the log is written. if --json is set, you see structured envelopes with parsed timestamp and level. when --max-events is reached or --timeout fires, the process exits cleanly (exit code 0).
check_deps.sh: if successful, python3 path and version are printed. if failed, a diagnostic is printed and exit code is