Local CSV / TSV / JSONL inspection and cleanup toolkit. Profile, validate, deduplicate, diff, preview (head/tail/random sample), filter with a safe predicate...
---
name: clean-csv-toolkit
description: Local CSV / TSV / JSONL inspection and cleanup toolkit. Profile, validate, deduplicate, diff, preview (head/tail/random sample), filter with a safe predicate language, sort, concat, merge (inner/left/right/outer joins), pivot (group-by aggregations + wide cross-tabs), transform (derived columns with safe expression evaluator: profit = revenue - cost, name = upper(first) + last, year(date), coalesce()), and convert between csv/tsv/jsonl/json/markdown. Pure Python 3 standard library, no pandas, no remote calls. Handles 100k+ rows in under 1 second.
license: MIT
metadata: {"openclaw":{"requires":{"bins":["python3"]},"primaryEnv":null,"homepage":"https://clawhub.ai/gopendrasharma89-tech/clean-csv-toolkit"}}
---
# clean-csv-toolkit
v0.5.0
A small honest toolkit for the work agents end up doing constantly: read a CSV someone sent you, work out what's in it, clean it up, and forward only the safe rows downstream. Built on Python 3 standard library only. No `pandas`, no `numpy`, no pip installs, no remote calls.
## What this skill does
- `scripts/inspect.py` — profile a `.csv` / `.tsv` / `.jsonl` file: row count, auto-detected column types (`int`, `float`, `bool`, `date`, `datetime`, `string`, `empty`), null counts per column, distinct value counts (capped), three sample values per column, file size, and detected encoding.
- `scripts/validate.py` — check the file against a small JSON schema (required columns, per-column type, min/max, enum, regex, unique). Exits 0/1 so it slots into CI.
- `scripts/dedupe.py` — remove duplicate rows by full-row match or by key columns. Optional `--keep first|last`, `--case-insensitive`, `--trim`, and a JSONL report of every removed row.
- `scripts/diff.py` — compare two files by key column(s) and classify every row as added / removed / changed / unchanged, with a per-column before/after diff for changed rows.
- `scripts/convert.py` — convert between CSV, TSV, JSON Lines, JSON array, and GitHub-flavored Markdown table.
- `scripts/head.py` (NEW in v0.2.0) — print the first N rows in csv / tsv / jsonl / md / aligned format, with optional column subset.
- `scripts/tail.py` (NEW in v0.2.0) — print the last N rows using a streaming ring buffer (works on multi-gigabyte files without loading them).
- `scripts/sample.py` (NEW in v0.2.0) — pick a uniformly random sample of N rows via reservoir sampling. Single-pass, O(N) memory, optional `--seed` for reproducibility, optional `--preserve-order` to keep original row order.
- `scripts/merge.py` (NEW in v0.3.0) — join two files on one or more key columns. Supports `inner` / `left` / `right` / `outer` joins, separate key names per side via `--left-on` and `--right-on`, and duplicate-column disambiguation via `--suffix-left` / `--suffix-right`. Streams the LEFT side, indexes the RIGHT side (peak memory ≈ size of right file).
- `scripts/pivot.py` (NEW in v0.3.0) — group-by aggregations and wide pivot tables. Aggregations: `count`, `sum`, `avg`/`mean`, `min`, `max`, `first`, `last`, `nunique`. Set `--pivot-on COL` to produce a wide cross-tab (e.g. region × product, sum of revenue). Numeric-aware `--sort-by` so `--sort-by revenue_sum --desc` orders correctly.
- `scripts/filter.py` (NEW in v0.4.0) — keep rows that match a safe predicate (`amount > 100`, `status in pending,approved`, `email =~ @example\.com$`, `name is_not_empty`). Supports `==`, `!=`, `<`, `<=`, `>`, `>=`, `=~` (regex), `in`, `contains`, `is_empty` / `is_not_empty` / `is_number` / `is_not_number`. `and` / `or` / parentheses / `not`. NO Python eval — a hand-rolled tokenizer + recursive-descent parser. Optional `--invert`, `--limit`, `--columns`.
- `scripts/sort.py` (NEW in v0.4.0) — stable, type-aware sort. Auto-detects which columns are numeric and sorts them numerically (so `1200 > 899 > 100 > 50`, not `"50" > "1200"`). Per-column direction with `--by amount:desc,region:asc`. Optional `--case-insensitive`, `--limit`, `--numeric` (force numeric on all sort cols).
- `scripts/concat.py` (NEW in v0.4.0) — stack files vertically (UNION ALL). Default mode unions the headers of all inputs; `--strict` requires identical headers; `--add-source COL` tags each row with its source filename; `--dedupe` drops exact-duplicate rows across inputs. Streams one file at a time.
- `scripts/transform.py` (NEW in v0.5.0) — add, modify, rename, drop, cast, or keep columns. Derived columns via a safe expression language (no `eval`): `--add 'profit = revenue - cost'`, `--add 'full_name = upper(first) + " " + upper(last)'`, `--add 'year = year(signup_date)'`, `--add 'safe = coalesce(value, default)'`. Built-in functions: upper, lower, strip, len, abs, round, int, float, str, replace, split, join, coalesce, year, month, day. Chainable with `--cast COL:int|float|bool|string`, `--rename OLD=NEW`, `--drop COL[,...]`, `--keep COL[,...]`.
- `scripts/check_deps.sh` — verify `python3` is available.
## What this skill does not do
- It does not call any LLM, web service, or remote API.
- It does not load a full dataframe into memory just to do simple structural work; the helpers stream rows where possible.
- It does not write outside the input/output paths the caller provides.
- It does not do statistical analysis (mean, percentile, correlation). For that, use a dataframe library.
- It does not parse Excel files (.xls / .xlsx). Export to CSV first.
## Required dependencies
```bash
bash scripts/check_deps.sh
```
Only `python3` is required. The skill uses `csv`, `json`, `re`, `pathlib`, `argparse`, `datetime`, `collections` — all stdlib.
## Workflows
### 0. Quickly preview an unknown CSV (NEW in v0.2.0)
```bash
# First 10 rows in a clean aligned table
python3 scripts/head.py mystery.csv
# Last 5 rows of a multi-GB log
python3 scripts/tail.py huge.csv -n 5
# A reproducible random sample for spot-checking
python3 scripts/sample.py customers.csv -n 20 --seed 42
# Preview only specific columns
python3 scripts/head.py customers.csv --columns id,email,status
# Emit a previewable Markdown table for an agent's reply
python3 scripts/head.py customers.csv -n 5 --as md
```
All three scripts accept `-n N`, `--as csv|tsv|jsonl|md|aligned`, `--output file`, and `--columns col1,col2,...`. `sample.py` additionally accepts `--seed INT` and `--preserve-order`. Default output format is `aligned` — a fixed-width text table sized to the actual data, which is what an agent usually wants to show inline. Default `N` is 10.
Streaming guarantees:
- `head.py` reads at most N+1 rows from the file.
- `tail.py` keeps a bounded `deque(maxlen=N)` and emits only the last N rows.
- `sample.py` uses reservoir sampling (algorithm R): single pass, O(N) memory regardless of file size.
On a 100,000-row / 1.6 MB CSV: `head -n 3` runs in ~50 ms, `tail -n 3` in ~180 ms, `sample -n 5` in ~260 ms.
### 1. Profile an unknown CSV
```bash
python3 scripts/inspect.py customers.csv
```
Output:
```
file: /path/customers.csv
size: 284 B (284 bytes)
encoding: utf-8
kind: csv
rows: 5
columns: 6
# name type nulls null% distinct sample
----------------------------------------------------------------------------------------------------
1 id int 0 0.00 5 '1', '2', '3'
2 email string 0 0.00 5 'alice@example.com', ...
3 name string 0 0.00 5 'Alice', 'Bob', 'Carol'
4 amount float 1 20.00 4 '42.50', '100.00', '7.25'
5 status string 0 0.00 3 'approved', 'pending', ...
6 signup_date date 0 0.00 5 '2025-01-15', ...
```
Pass `--json` for machine-readable output that pipes into other tools.
The script auto-detects the dialect (CSV vs TSV vs JSON Lines) and a sensible encoding (`utf-8`, `utf-8-sig`, `cp1252`, `latin-1`). Type inference takes up to 1000 non-empty values per column and picks the most specific type that fits all of them.
### 2. Validate against a schema
Write a `schema.json`:
```json
{
"required_columns": ["id", "email", "amount", "status"],
"columns": {
"id": {"type": "int", "required": true, "unique": true, "min": 1},
"email": {"type": "string", "required": true, "regex": ".+@.+\\..+"},
"amount": {"type": "float", "min": 0, "max": 100000},
"status": {"type": "string", "enum": ["pending", "approved", "rejected"]},
"signup_date": {"type": "date"}
}
}
```
Then:
```bash
python3 scripts/validate.py customers.csv --schema schema.json
```
A clean file exits 0 with `verdict: pass`. A bad file exits 1 with a detailed error table:
```
row column kind detail
------------------------------------------------------------------------------------------------
2 email regex_mismatch value did not match regex | value='not-an-email'
2 amount bad_type value does not match type 'float' | value='abc'
3 amount below_min value -50.0 < min 0 | value='-50.00'
3 status not_in_enum value not in allowed set | value='unknown_status'
4 id duplicate_unique value already seen earlier in this column | value='1'
```
Pass `--json` for a structured report and `--max-errors N` to cap collection on huge files.
### 3. Remove duplicates
By full-row match (any two rows identical in every column):
```bash
python3 scripts/dedupe.py messy.csv clean.csv
```
By a key column (only one canonical row per `id`):
```bash
python3 scripts/dedupe.py messy.csv clean.csv --key id \
--removed-report removed.jsonl
```
`--keep first` (default) keeps the earlier-occurring row; `--keep last` keeps the later one — useful when later rows are corrections. `--case-insensitive` and `--trim` normalise key values before comparison so `" alice@example.com"` and `"ALICE@example.com"` collapse to one row.
The `--removed-report` writes one JSON object per removed row, with the original 1-based row index, the key tuple that was duplicated, and the full row, so the dedup decision is auditable.
### 4. Diff two files
```bash
python3 scripts/diff.py customers_old.csv customers_new.csv --key id
```
Output:
```
added: 1
removed: 1
changed: 1
--- ADDED (1) ---
+ 6
--- REMOVED (1) ---
- 4
--- CHANGED (1) ---
~ 2
amount: '100.00' -> '150.00'
status: 'pending' -> 'approved'
```
Multi-column keys are supported: `--key customer_id,date`. Exit codes are 0 if the files are identical on the key columns, 1 if they differ — so this also works as a CI guard ("fail the build if the snapshot file changed").
### 5. Convert between formats
```bash
python3 scripts/convert.py data.csv data.jsonl # row -> JSON Lines
python3 scripts/convert.py data.jsonl data.csv # back
python3 scripts/convert.py data.csv data.json --pretty
python3 scripts/convert.py data.csv data.md # GitHub-flavored table
python3 scripts/convert.py data.tsv data.csv # delimiter change
```
Output format is picked from the extension. Allowed extensions: `.csv`, `.tsv`, `.jsonl`, `.json`, `.md`. The Markdown writer escapes `|` and `\n` in cell values so the table stays well-formed.
### 6. Join two files (NEW in v0.3.0)
```bash
# Inner join: only users with at least one order
python3 scripts/merge.py users.csv orders.csv joined.csv \
--left-on id --right-on user_id
# Left join: keep every user, fill unmatched with empty strings
python3 scripts/merge.py users.csv orders.csv left.csv \
--left-on id --right-on user_id --how left
# Same key name on both sides: --on shorthand
python3 scripts/merge.py users.csv orders.csv out.csv --on user_id
# Outer join into JSON Lines, machine-readable summary on stdout
python3 scripts/merge.py a.csv b.csv full.jsonl --on key --how outer --json
```
Duplicate non-key columns are auto-renamed with `--suffix-left` / `--suffix-right` (defaults `_x` / `_y`).
### 7. Group-by aggregations and wide pivots (NEW in v0.3.0)
```bash
# Sum revenue per region
python3 scripts/pivot.py sales.csv by_region.csv \
--group-by region --agg revenue:sum --sort-by revenue_sum --desc
# Multiple aggregations per group
python3 scripts/pivot.py sales.csv detail.csv \
--group-by region,product \
--agg "units:sum,revenue:sum,revenue:avg,product:nunique"
# Wide cross-tab: region × product matrix of revenue
python3 scripts/pivot.py sales.csv crosstab.csv \
--group-by region --pivot-on product --agg revenue:sum --fill 0
# Same wide pivot rendered as Markdown for a report
python3 scripts/pivot.py sales.csv crosstab.md \
--group-by region --pivot-on product --agg revenue:sum --fill "-"
```
Aggregation functions: `count`, `sum`, `avg`/`mean`, `min`, `max`, `first`, `last`, `nunique`. Output column names follow `<col>_<func>` (e.g. `revenue_sum`). `--sort-by` is numeric-aware: numeric columns are ordered numerically, string columns lexicographically.
### 8. Filter rows (NEW in v0.4.0)
```bash
# Numeric comparison
python3 scripts/filter.py orders.csv big.csv --where "amount > 100"
# Combine boolean conditions
python3 scripts/filter.py orders.csv top.csv \
--where "status == approved and amount >= 50"
# Set membership (commas are part of the value, not separators)
python3 scripts/filter.py users.csv targeted.csv \
--where "country in IN,US,UK and signup_year >= 2024"
# Regex match
python3 scripts/filter.py users.csv company.csv \
--where 'email =~ @example\.com$'
# Null / type checks (no right-hand side)
python3 scripts/filter.py users.csv missing.csv --where "phone is_empty"
# Invert the predicate, write only specific columns, cap at N matches
python3 scripts/filter.py log.csv non_errors.csv \
--where "level == ERROR" --invert --columns ts,msg --limit 1000
```
The expression language is deliberately small and is parsed by a hand-rolled tokenizer + recursive-descent parser. There is no `eval`, no shell, no subprocess.
### 9. Sort by one or more columns (NEW in v0.4.0)
```bash
# Auto-numeric sort, descending
python3 scripts/sort.py sales.csv s.csv --by amount:desc
# Multi-key: country ascending, signup_date descending (stable)
python3 scripts/sort.py users.csv s.csv --by country:asc,signup_date:desc
# Top 10 by revenue
python3 scripts/sort.py sales.csv top10.csv --by revenue:desc --limit 10
# Case-insensitive string sort
python3 scripts/sort.py contacts.csv s.csv --by name --case-insensitive
```
Each `--by` column is treated numerically when every value parses as a number, otherwise string. `--numeric` forces numeric on all sort columns (non-numeric rows sort last).
### 10. Concatenate multiple files (NEW in v0.4.0)
```bash
# Stack monthly shards into one CSV (header union)
python3 scripts/concat.py all_quarter.csv jan.csv feb.csv mar.csv
# Strict mode: require every input to have an identical header
python3 scripts/concat.py all.csv jan.csv feb.csv mar.csv --strict
# Tag each row with the source filename (without extension)
python3 scripts/concat.py tagged.csv shard_*.csv --add-source origin --source-stem
# Stack + drop duplicate rows across files
python3 scripts/concat.py all.csv jan.csv feb.csv apr.csv --dedupe
```
### 11. Transform columns (NEW in v0.5.0)
```bash
# Add a derived column
python3 scripts/transform.py orders.csv with_profit.csv \
--add 'profit = revenue - cost'
# Multiple --add operations + cast + final column selection
python3 scripts/transform.py sales.csv clean.csv \
--add 'profit = revenue - cost' \
--add 'margin_pct = round(profit / revenue * 100, 1)' \
--add 'name = lower(strip(first_name)) + "_" + lower(strip(last_name))' \
--add 'signup_year = year(signup)' \
--cast revenue:float --cast cost:float \
--keep id,name,country,revenue,profit,margin_pct,signup_year
# Rename and drop
python3 scripts/transform.py users.csv clean.csv \
--rename 'signup=joined_date' --drop password_hash
# Boolean comparisons produce 0/1 columns
python3 scripts/transform.py orders.csv flagged.csv \
--add 'is_high_value = amount > 1000'
# Fallback for missing values
python3 scripts/transform.py users.csv filled.csv \
--add 'safe_email = coalesce(email, "unknown@example.com")'
```
The expression language is intentionally small: arithmetic (`+ - * / %`), string concat (`+`), comparisons (`== != < <= > >=` → yield 0/1), parentheses, identifiers (column references), string and number literals, and function calls. **No `eval`, no `subprocess`, no shell.** Empty cells that propagate into arithmetic leave the derived value empty for that row instead of crashing the pipeline.
## Exit codes
| Code | Meaning |
|---|---|
| 0 | success / validation pass / files identical |
| 1 | validation fail / files differ / no rows in input |
| 2 | bad arguments / unsafe path / missing input / unsupported extension / schema malformed |
This 0/1/2 split is consistent across all five scripts, so they slot into shell pipelines cleanly:
```bash
python3 scripts/validate.py incoming.csv --schema schema.json \
&& python3 scripts/dedupe.py incoming.csv clean.csv --key id \
&& python3 scripts/inspect.py clean.csv
```
## Safety properties
- Pure Python 3 standard library. No third-party dependencies.
- No `subprocess` calls. No shell invocation.
- All file paths are validated against a strict allowlist regex that rejects shell metacharacters (`;`, `|`, `&`, `>`, `<`, `$`, `` ` ``, backslash-newline, etc.).
- Scripts only read the input paths the caller provides and write to the output paths the caller provides. No temp files outside the system's tempdir.
- All inputs and outputs use UTF-8 by default; CSV reads auto-fall-back through utf-8-sig, cp1252, and latin-1 when the file's encoding is non-UTF-8.
- Deterministic: the same input produces the same output every time.
## Performance
- `inspect.py` profiles 10,000 rows in well under one second on a single core (single-pass streaming read).
- All scripts stream rows; they do not load the entire file into memory for processing. The exception is `dedupe.py` and `diff.py`, which build an in-memory dict keyed by row identity — fine for hundreds of thousands of rows on a typical laptop.
- No background threads, no process pool, no caching.
## Known limitations
- Type inference uses regex-shape matching, not locale-aware parsing. `"1,234.56"` is detected as `string`, not `float`. Re-export with a different number format if you need different inference.
- The Markdown writer flattens multi-line cells to single lines (newlines become spaces).
- JSON Lines input must have one JSON object per line. Multi-line JSON arrays are not supported; use the regular CSV/JSONL pipeline.
## v0.5.0 changes
- Added `scripts/transform.py`: derived columns + schema operations. Hand-rolled tokenizer + recursive-descent parser (no `eval`, no subprocess) supports arithmetic (+/-/* / / %), string concat (+), parentheses, function calls, and boolean comparisons that yield 0/1. Built-in functions: upper, lower, strip, len, abs, round, int, float, str, replace, split, join, coalesce, year, month, day. Six op kinds: `--add`, `--set`, `--drop`, `--rename`, `--cast`, `--keep`. Schema is computed symbolically before the streaming pass, so empty cells in arithmetic columns don't crash the whole pipeline — they just leave the derived value empty for that single row.
- Bug fixed during testing: the schema-detection pass was running expressions against an empty-string row, which broke arithmetic. Replaced with a purely-structural schema walk.
## v0.4.0 changes
- Added `scripts/filter.py`: safe-predicate row filter. Hand-rolled tokenizer + recursive-descent parser, NO `eval` and no `subprocess`. Numeric and string compare, regex (`=~`), `in COMMA,LIST`, `contains`, `is_empty` / `is_not_empty` / `is_number` / `is_not_number`. Boolean `and` / `or` / `not` with parentheses. `--invert`, `--limit`, `--columns`.
- Added `scripts/sort.py`: type-aware stable sort with per-column direction (`--by amount:desc,region:asc`). Auto-detects numeric columns. Optional `--case-insensitive`, `--limit`, `--numeric`. 100k rows sorted in ~0.3 s.
- Added `scripts/concat.py`: vertical UNION ALL of multiple CSV / TSV / JSONL files. Header union by default, `--strict` for exact-match check, `--add-source` to tag rows, `--dedupe` to drop exact-duplicate rows. Streams one input at a time, memory does not grow with the number of inputs (unless `--dedupe` is set).
- All three scripts honor the existing safe-path policy and the 0 / 1 / 2 exit-code contract.
## v0.3.0 changes
- Added `scripts/merge.py`: join two CSV / TSV / JSONL files on one or more key columns. Supports inner, left, right, outer joins; separate key names per side; duplicate-column suffixing; CSV / TSV / JSONL output. Single-pass over LEFT, peak memory ≈ size of RIGHT. Merged 50k × 200k rows in ~1.3 s.
- Added `scripts/pivot.py`: group-by aggregations and wide pivot tables. Functions: count, sum, avg, min, max, first, last, nunique. Wide mode produces region × product cross-tabs. Numeric-aware sort. Streamed 100k rows in ~0.7 s.
- All new scripts honor the existing safe-path policy (no shell metacharacters), use exit code 2 for bad arguments / missing files / missing columns, exit 1 for empty results, exit 0 for success. Output extension is validated against an explicit allow-list per script.
## v0.2.0 changes
**Three new preview helpers** (`head.py`, `tail.py`, `sample.py`):
- `head.py` and `tail.py` give shell-style preview that is format-aware and never mangles quoting the way a naive `head` / `tail` would. They auto-detect dialect (csv/tsv/jsonl), let you pick the output format with `--as`, and can re-emit any subset of columns with `--columns`.
- `sample.py` runs reservoir sampling (algorithm R): a single streaming pass, O(N) memory regardless of file size. `--seed INT` makes the sample reproducible so it slots into test suites and CI; `--preserve-order` re-sorts the reservoir back into original row order.
- All three share the same `--as csv|tsv|jsonl|md|aligned`, `--output`, and `--columns` flags, mirroring the convention already used by `convert.py`.
- Default output format is `aligned`, a fixed-width text table that an agent can paste straight into a reply.
**Performance**: on a 100,000-row / 1.6 MB CSV, `head` runs in ~50 ms, `tail` in ~180 ms, `sample` in ~260 ms.
**No breaking changes**: every v0.1.0 CLI flag, output format, and exit-code contract is preserved.
## License
MIT. See `LICENSE`.
don't have the plugin yet? install it then click "run inline in claude" again.