Local log file inspection and analysis toolkit. Parse common log formats (apache-common, apache-combined, nginx-access, syslog, JSON-line) or custom regex wi...
---
name: clean-log-toolkit
description: Local log file inspection and analysis toolkit. Parse common log formats (apache-common, apache-combined, nginx-access, syslog, JSON-line) or custom regex with named groups into structured TSV/CSV/JSONL. Aggregate errors by level and time bucket (minute/hour/day), surface the most common error groups via fingerprint normalization, and produce JSON/Markdown/CSV reports. Grep log lines with optional time-window (--since/--until), level filter, named-group regex, and -B/-A/-C context lines. Pure Python 3 standard library, no third-party dependencies, no remote calls.
license: MIT
metadata: {"openclaw":{"requires":{"bins":["python3"]},"primaryEnv":null,"homepage":"https://clawhub.ai/gopendrasharma89-tech/clean-log-toolkit"}}
---
# clean-log-toolkit
v0.1.1
A small, honest local toolkit for the work agents end up doing constantly: read a log someone sent you, figure out the format, find the actual problems, and produce a summary you can paste into a ticket. Built on Python 3 standard library only. No `awk`/`sed`/`jq` wrappers, no pip installs, no remote calls.
This skill is the third of the "clean-*" trio:
- [`clean-csv-toolkit`](https://clawhub.ai/gopendrasharma89-tech/clean-csv-toolkit) — structured tabular data
- [`clean-text-toolkit`](https://clawhub.ai/gopendrasharma89-tech/clean-text-toolkit) — unstructured text
- **`clean-log-toolkit`** — semi-structured timestamped logs
## What this skill does
- `scripts/parse.py` — parse a log file into structured rows. Auto-detects `apache-common`, `apache-combined`, `nginx-access`, `syslog`, and `json-line` formats by sniffing the first ~50 lines. Falls back to a generic timestamp + level + message extractor when nothing matches. Pass `--regex PATTERN` with named groups to define a custom format. Output as `.csv`, `.tsv`, or `.jsonl`.
- `scripts/errors.py` — aggregate the errors in a log file. Counts by level (WARN / ERROR / FATAL by default), buckets the timeline by minute / hour / day, normalizes each message into a "fingerprint" (replaces numbers, UUIDs, hex tokens, file:line pairs, and embedded timestamps with placeholders) and surfaces the top-N most frequent error groups. Writes a JSON / Markdown / CSV report or prints a one-screen summary.
- `scripts/grep.py` — grep, but log-aware. Combine `--pattern REGEX`, `--not-pattern REGEX`, `--level LVL[,LVL2...]`, `--since TIMESTAMP`, `--until TIMESTAMP`, and `-B / -A / -C` context lines into one filter pass. Output goes to stdout or to a file. Returns exit 0 on at least one match, 1 on zero matches.
- `scripts/check_deps.sh` — verify `python3` is available.
## What this skill does not do
- It does not tail/follow live log files (yet — possible v0.2 feature if there's demand).
- It does not call any LLM, web service, or remote API.
- It does not write outside the input/output paths the caller provides.
## Quick start
### 1. Parse an unknown log file
```bash
# Auto-detect the format
python3 scripts/parse.py app.log app.csv
# Or be explicit
python3 scripts/parse.py access.log out.jsonl --format apache-combined
python3 scripts/parse.py syslog.txt out.csv --format syslog
python3 scripts/parse.py events.log out.csv --format json-line --fields ts,level,msg
```
### 2. Custom format via named-group regex
```bash
python3 scripts/parse.py app.log structured.csv \
--regex '^(?P<ts>\S+)\s+(?P<level>\S+)\s+(?P<message>.*)$'
```
### 3. Aggregate errors and produce a report
```bash
# One-screen summary
python3 scripts/errors.py app.log
# Bucket by minute, top 20 message groups
python3 scripts/errors.py app.log --bucket minute --top 20
# Only count specific levels
python3 scripts/errors.py app.log --level ERROR,FATAL
# Write a Markdown report ready to paste into a ticket
python3 scripts/errors.py app.log --output report.md
# Or a JSON report for downstream tooling
python3 scripts/errors.py app.log --output report.json --bucket hour
# Or a CSV of the timeline only
python3 scripts/errors.py app.log --output timeline.csv --bucket minute
```
`errors.py` fingerprints messages so repeated errors that only differ in numbers / UUIDs / file-line refs collapse to one group with a count. Example: 50 occurrences of `Connection timeout to 10.0.0.5 after 1234ms` and `Connection timeout to 10.0.0.7 after 567ms` collapse into one group `Connection timeout to <N>.<N>.<N>.<N> after <N>ms` with count 50.
### 4. Log-aware grep
```bash
# Pattern + level filter
python3 scripts/grep.py app.log --pattern "Database" --level ERROR,FATAL
# Time window
python3 scripts/grep.py app.log \
--since "2026-05-10T10:00:00Z" \
--until "2026-05-10T11:00:00Z"
# Context lines (1 before + 1 after each match)
python3 scripts/grep.py app.log --pattern "FATAL" -C 1 --with-line
# Exclude noisy lines while keeping the rest
python3 scripts/grep.py app.log --level ERROR --not-pattern "heartbeat"
# Invert: keep everything that does NOT match
python3 scripts/grep.py app.log --pattern "INFO" --invert
```
`--since` and `--until` accept the same timestamp formats `parse.py` understands: ISO 8601 (`2026-05-10T10:00:00Z`, `2026-05-10 10:00:00`, with or without microseconds / timezone), apache-style (`10/May/2026:10:00:00 +0000`), and syslog (`May 10 10:00:00` — current year assumed).
## Exit codes
| Code | Meaning |
|---|---|
| 0 | success / one or more rows / one or more matches |
| 1 | parse produced zero rows / grep found zero matches / errors found zero matching log entries |
| 2 | bad arguments / unsafe path / missing input / bad regex / unknown format / unsupported output extension |
This 0 / 1 / 2 split is consistent across all three scripts so they slot into shell pipelines cleanly:
```bash
# Parse to JSONL, then summarize errors, then post to a ticket
python3 scripts/parse.py raw.log structured.jsonl \
&& python3 scripts/errors.py raw.log --output ticket.md \
&& cat ticket.md
```
## Safety properties
- Pure Python 3 standard library. No third-party dependencies.
- No `subprocess` calls. No shell invocation.
- All file paths are validated against a strict allowlist regex that rejects shell metacharacters. The same `safe_path()` helper used in `clean-csv-toolkit` and `clean-text-toolkit`.
- Scripts only read the input paths the caller provides and write to the output paths the caller provides.
- All inputs default to UTF-8; reads fall back through `utf-8-sig`, `cp1252`, `latin-1` if needed. Writes are always UTF-8.
## Timestamp + level detection
`_common.py` ships a pragmatic timestamp parser that tries the following formats in order, picking the first that matches:
```
2026-05-10T10:00:00.123456+00:00 (ISO 8601 with TZ + microseconds)
2026-05-10T10:00:00+00:00 (ISO 8601 with TZ)
2026-05-10T10:00:00.123Z (ISO 8601 UTC Zulu)
2026-05-10T10:00:00Z (ISO 8601 UTC Zulu)
2026-05-10T10:00:00 (ISO 8601 no TZ)
2026-05-10 10:00:00 (space-separated)
2026/05/10 10:00:00
10/May/2026:10:00:00 +0000 (apache common log)
May 10 10:00:00 (syslog, no year)
```
Levels are detected case-insensitively from these tokens and folded to canonical names: `TRACE`, `DEBUG`, `INFO`, `NOTICE`, `WARN` (from WARN/WARNING), `ERROR` (from ERROR/ERR), `FATAL` (from FATAL/CRITICAL/CRIT/EMERG/EMERGENCY).
## Known limitations
- The regex-based parsers are pragmatic, not strict — they accept slightly malformed Apache / nginx / syslog lines as long as the structure is close enough.
- `errors.py` fingerprint normalization is a best-effort heuristic. Two semantically different errors that happen to differ only in numbers / hashes will be collapsed; if that matters, use `--top` with a larger N and inspect the samples.
- `parse.py` does not follow a live log file. For tail-follow, pipe `tail -F file | ...` into your own tool. If there's enough demand for a built-in follower, it will land in v0.2.
## Pairs well with
- [`clean-csv-toolkit`](https://clawhub.ai/gopendrasharma89-tech/clean-csv-toolkit) — pipe `parse.py` output (CSV / JSONL) into `inspect`, `validate`, `pivot`, or `transform` to turn raw logs into reportable tables.
- [`clean-text-toolkit`](https://clawhub.ai/gopendrasharma89-tech/clean-text-toolkit) — pair `parse.py` with `text-toolkit/redact.py` to scrub PII before sharing log dumps.
## v0.1.1 changes
- Fixed timestamp parser: `--since` and `--until` on `grep.py` now accept date-only values like `2026-05-09`, `2026/05/09`, and `09/05/2026` (European). Previously only full ISO 8601 timestamps were accepted, so users trying to filter by a calendar date got a `could not parse --since` error.
## v0.1.0 changes
- First public release of clean-log-toolkit.
- Three scripts: `parse.py`, `errors.py`, `grep.py`.
- Shared `_common.py` with `safe_path`, `iter_lines`, `parse_timestamp`, `extract_timestamp`, `extract_level` helpers (mirrors the design of `clean-csv-toolkit/scripts/_common.py` and `clean-text-toolkit/scripts/_common.py`).
- Auto-detects 5 log formats by sniffing the first 50 lines.
- Zero third-party dependencies; works on any system that ships Python 3.
## License
MIT
don't have the plugin yet? install it then click "run inline in claude" again.
by @clawhub
formalized intent, inputs, and three distinct procedures (parse.py, errors.py, grep.py) with explicit decision points covering format auto-detection, encoding fallback, timestamp parsing, filtering, and edge cases like empty results and fingerprint collisions; added output contract with specific file formats, location, and success criteria; added outcome signal showing how users verify execution success via file existence, stdout messages, and exit codes.
v0.1.1
local toolkit for reading log files, detecting format, finding actual problems, and producing summaries you can paste into tickets. built on python 3 standard library only. no awk/sed/jq wrappers, no pip installs, no remote calls.
this skill handles the repetitive work of log inspection: auto-detect format (apache-common, apache-combined, nginx-access, syslog, json-line, or custom regex), parse lines into structured rows (csv/tsv/jsonl), aggregate errors by level and time bucket, normalize repeated errors via fingerprint hashing to surface root causes, and grep logs with time-window and level filters. use this when you need to turn an unstructured log dump into actionable summary data for a ticket or incident response without leaving the command line.
input_logfile (required): path to a local log file. must be readable and under 1 GB (no hard limit enforced, but memory use scales linearly with file size). encoding auto-detected: tries utf-8-sig, then cp1252, then latin-1.--format (optional, parse.py only): one of apache-common, apache-combined, nginx-access, syslog, json-line. if omitted, auto-detects by sniffing first 50 lines. falls back to generic timestamp + level + message extraction if no match.--regex PATTERN (optional, parse.py only): python regex with named groups ((?P<name>...)) to define custom format. named groups must include at least one of: ts (timestamp), level (log level), message (message text). overrides --format.--fields FIELD1,FIELD2,... (optional, parse.py only): comma-separated list of output columns. only applies with --format json-line. defaults to all fields found.--bucket {minute,hour,day} (optional, errors.py only): time granularity for error timeline aggregation. defaults to hour.--level LEVEL1,LEVEL2,... (optional, errors.py and grep.py): comma-separated log levels to include. case-insensitive. recognized levels: trace, debug, info, notice, warn, warning, error, err, fatal, critical, crit, emerg, emergency. defaults to warn,error,fatal for errors.py; no default for grep.py (matches all levels).--top N (optional, errors.py only): return top N most frequent error fingerprints. defaults to 10.--output FILE (optional, errors.py only): write report to file. extension determines format: .json, .md (markdown), .csv. if omitted, prints one-screen summary to stdout.--pattern REGEX (optional, grep.py only): include lines matching this python regex. combined with --not-pattern in same pass (both filters apply).--not-pattern REGEX (optional, grep.py only): exclude lines matching this regex.--since TIMESTAMP (optional, grep.py only): include lines on or after this timestamp. accepts iso 8601 (with or without tz/microseconds), space-separated date-time, slash-separated date-time, apache-style (10/May/2026:10:00:00 +0000), syslog-style (May 10 10:00:00, year assumed current), or date-only (2026-05-10, 2026/05/10, 10/05/2026).--until TIMESTAMP (optional, grep.py only): include lines before or at this timestamp. same format as --since.-B N (optional, grep.py only): print N lines before each match.-A N (optional, grep.py only): print N lines after each match.-C N (optional, grep.py only): shorthand for -B N -A N.--invert (optional, grep.py only): invert match logic; keep lines that do not match --pattern.--with-line (optional, grep.py only): prepend line number to each output line.external connections: none. all operations are local file i/o.
.csv, .tsv, .jsonl. fail with exit code 2 if unsupported.--regex is provided, compile it as a python regex. fail with exit code 2 if invalid regex or no named groups.--regex is not provided and --format is given, validate against supported formats. fail with exit code 2 if unknown.--regex nor --format is given, open input file and sniff first 50 lines to auto-detect format (test against apache-common, apache-combined, nginx-access, syslog, json-line patterns in that order). if no match, use fallback generic parser.--fields is set (json-line only), filter to named columns., or \t as delimiter, include header row.--level (default warn,error,fatal).--bucket (minute/hour/day). truncate timestamp to bucket granularity.<N>, <UUID>, <HEX>, <FILE:LINE>, <TS>).--output is omitted, print one-screen summary (level, top 3 fingerprints, min/max timestamp). exit 0 if errors found, exit 1 if none.--output FILE is set:.json: write dict with structure {"level": [...], "bucket": "...", "fingerprints": [{"fingerprint": "...", "count": N, "sample": "...", "buckets": {timestamp: count, ...}}, ...], "generated_at": "..."}..md: write markdown table with columns fingerprint, count, sample, buckets. include header "Error Report" and generation timestamp..csv: write csv with columns level, bucket, fingerprint, count, sample. include header.--output FILE is set, validate output path similarly.--pattern regex if provided. fail with exit code 2 if invalid.--not-pattern regex if provided. fail with exit code 2 if invalid.--since timestamp using same logic as parse.py. fail with exit code 2 if unparseable.--until timestamp similarly. fail with exit code 2 if unparseable.-B/-A context). iterate:--since or --until is set).--level is set and level not in set).--pattern (skip if not provided or doesn't match). test against --not-pattern (skip if provided and matches).--since is set and timestamp < since, or --until is set and timestamp > until.--output FILE.--with-line is set, prepend 1-indexed line number to each output line.--regex is provided, use custom regex and skip format detection. else if --format is provided, use named format (validate it exists). else sniff first 50 lines against apache-common, apache-combined, nginx-access, syslog, json-line in that order; use first match. if no match, fall back to generic timestamp + level + message extractor.--since), treat line as missing timestamp and skip (or for parse.py, use empty string).--level is not provided in errors.py, default to warn,error,fatal. if --level is not provided in grep.py, no filtering (match all levels). comparison is case-insensitive and folds aliases (warn/warning, error/err, fatal/critical/crit/emerg/emergency).--top with a larger N and inspect samples manually, or re-run with --regex to extract a more precise discriminator.parse.py ... && errors.py ... stops on zero rows.--since or --until lacks timezone and matches a format that should have one (iso 8601 with tz, apache-style with tz), assume utc. for syslog-style (month day hh:mm:ss, no year or tz), assume current year and local timezone. if ambiguous (e.g., 2026-05-10 with no tz), assume utc.^[a-z0-9_./:-]+$ equivalent) to reject shell metacharacters, symlink traversal (../, ..), and suspicious patterns. all file i/o uses standard python open(), no subprocess calls.--fields specified. minimum: timestamp, level, message. apache-combined adds remote_addr, http_method, request_path, http_status, response_bytes, referer, user_agent. nginx-access similar. json-line retains all fields."" in csv/tsv, null in jsonl).--output FILE.--output), or json/markdown/csv file (if --output specified).--output): human-readable one-screen output. example: "ERROR: 523 total, top 3 fingerprints: [1] Connection timeout... (234x) [2] Disk full... (145x) [3] Auth failed... (92x). Time range: 2026-05-10T08:00:00Z to 2026-05-10T15:30:00Z".--output report.json): dict with keys levels (list of level strings found), bucket (granularity), fingerprints (list of dicts with fingerprint, count, sample, buckets sub-dict), time_range (dict with min, max), generated_at (iso 8601 timestamp).--output report.md): formatted as markdown table. example: "# Error Report\n\n|Fingerprint|Count|Sample|Time Buckets|\n|---|---|---|---|\n|Connection timeout...|234|Connection timeout to 10.0.0.5...|{2026-05-10T10:00:00Z: 5, ...}|\n...".--output report.csv): header level,bucket,fingerprint,count,sample. one row per (level, bucket, fingerprint, count) tuple. sample is truncated to ~200 chars to fit.-B/-A/-C) interleaved with matches.--with-line, format is "NNNN: --output FILE.--output file is set, file exists and contains structured report (json/markdown/csv). exit code is 0.