Local text cleanup and inspection toolkit. Extract structured items (URLs, emails, phones, IPs, dates, hashtags, money), redact PII (email/phone/credit-card-...
---
name: clean-text-toolkit
description: Local text cleanup and inspection toolkit. Extract structured items (URLs, emails, phones, IPs, dates, hashtags, money), redact PII (email/phone/credit-card-with-Luhn/SSN/JWT/AWS keys/UUIDs), normalize (BOM/CRLF/smart-quotes/whitespace/tabs/case/Unicode NFC), line utilities (count/dedupe/sort/shuffle/head/tail), word-frequency stats with stopwords, three-mode text diffs (unified/side/HTML), no-Jinja2 template renderer with filters and defaults, URL-safe slug generator, and Markdown converter (strip-to-text / minimal HTML / extract headings/links/images/code/lists). Pure Python 3 standard library, no third-party dependencies, no remote calls.
license: MIT
metadata: {"openclaw":{"requires":{"bins":["python3"]},"primaryEnv":null,"homepage":"https://clawhub.ai/gopendrasharma89-tech/clean-text-toolkit"}}
---
# clean-text-toolkit
v0.3.0
A small, honest local toolkit for the work agents end up doing constantly: read some text someone sent you, find the structured bits, clean it up, redact the secrets, and forward it downstream. Built on Python 3 standard library only. No `pandas`, no `nltk`, no pip installs, no remote calls.
This skill is the companion to [`clean-csv-toolkit`](https://clawhub.ai/gopendrasharma89-tech/clean-csv-toolkit): that one handles structured tabular data, this one handles unstructured text.
## What this skill does
- `scripts/extract.py` — pull structured items out of any text file. Kinds: `url`, `email`, `phone`, `ipv4`, `ipv6`, `hashtag`, `mention`, `hex-color`, `money`, `iso-date`. Output to stdout (one-per-line or JSON), or to a `.txt` / `.json` / `.jsonl` file. Optional `--unique`, `--sort`, `--with-line` (prefix with the source line number).
- `scripts/normalize.py` — clean up messy text. Chainable transforms applied in command-line order: `--trim`, `--collapse-spaces`, `--strip-blank`, `--to-unix`, `--to-crlf`, `--dehyphenate` (rejoin OCR/PDF hyphenated line-breaks), `--unsmart` (smart quotes / em-dashes → ASCII), `--strip-bom`, `--strip-zwsp` (zero-width spaces and joiners), `--tabs-to-spaces N`, `--spaces-to-tabs N`, `--lower` / `--upper` / `--title`, `--normalize-unicode NFC|NFD|NFKC|NFKD`.
- `scripts/redact.py` — anonymize text by replacing PII-like patterns with placeholder tokens. Kinds: `email`, `phone`, `ipv4`, `ipv6`, `url`, `credit-card` (with Luhn validation to suppress false positives), `ssn-us`, `uuid`, `hex-token` (32+ hex chars, typical for tokens / hashes), `aws-access-key` (AKIA…), `jwt` (three base64url segments with the `eyJ` header). `--keep-counts` makes the same value always get the same placeholder; `--preserve-length` pads/truncates the placeholder to the original length.
- `scripts/lines.py` — line-oriented utilities. `--op count | dedupe | sort | shuffle | head | tail`. Streams `count`, `head`, `tail`. `dedupe` and `sort` are O(N) memory in the number of lines, but each line is small so 1 M lines is fine on a laptop. `--case-insensitive`, `--keep first|last`, `--numeric`, `--reverse`, `--seed` for deterministic shuffles.
- `scripts/wordcount.py` — word / character / line / sentence statistics. Optional `--top N` for most-frequent words, `--stopwords PATH`, `--min-length N`, `--ignore-case`, `--regex PATTERN` (default `[A-Za-z']+`).
- `scripts/diff_text.py` — three-mode text diff using stdlib `difflib`. `--mode unified` (default), `--mode side` (custom two-column layout), `--mode html` (writes a full HTML file with red/green coloring). `--ignore-case`, `--ignore-whitespace`, `--context N`.
- `scripts/template.py` (NEW in v0.2.0) — substitute placeholders in a text file with values from a JSON object or inline `--set key=value` overrides. Mustache (`{{name}}`), dollar (`${name}`), or percent (`%(name)s`) syntax. Filters: upper, lower, title, strip, capitalize, reverse, len, escape-html, escape-json, urlencode. Default values: `{{name ?Unknown}}`. Strict mode (`--strict`) exits 1 if any placeholder is unresolved. **No Jinja2, no `eval`.**
- `scripts/slug.py` (NEW in v0.2.0) — turn strings into URL-safe slugs. Single string mode (`--text "Hello World"`) or batch mode (line-in-file -> line-out-file). Options: `--separator`, `--max-length`, `--no-lower`, `--ascii` (Unicode -> ASCII transliteration via NFKD), `--keep-dots` (useful for filenames), `--dedupe`.
- `scripts/markdown.py` (NEW in v0.2.0) — strip Markdown to plain text, render a minimal HTML approximation, or extract structured items (headings, links, images, code blocks, list items) as JSON / JSONL / TSV. For text mode, `--link-style anchor|url|both` controls how `[text](url)` is rendered.
- `scripts/replace.py` (NEW in v0.3.0) — find-and-replace with regex / literal / word-boundary modes, capture-group back-references (`\1`, `\2`), multiple `--find/--replace` pairs in a single pass, or a JSON `--rules` file with per-rule settings. `--dry-run` previews matches with line:col and context; `--max N` caps replacements per rule. Returns exit 1 when zero replacements happen so it slots into CI.
- `scripts/check_deps.sh` — verify `python3` is available.
## What this skill does not do
- It does not call any LLM, web service, or remote API.
- It does not load entire files into memory unless an operation truly needs the whole file (full-content normalization, sort-and-write, diff). Streaming-friendly operations (`extract`, `lines --op count|head|tail`, `wordcount` for chars/lines counters) read one line at a time.
- It does not write outside the input/output paths the caller provides.
## Quick start
### 1. Pull every email out of a log file
```bash
python3 scripts/extract.py app.log --kind email --unique --sort
python3 scripts/extract.py app.log --kind email --output emails.txt --unique
```
### 2. Find every URL and tag it with the source line
```bash
python3 scripts/extract.py article.md --kind url --with-line
```
### 3. Clean up a messy OCR dump
```bash
python3 scripts/normalize.py scanned.txt clean.txt \
--strip-bom --to-unix --dehyphenate --collapse-spaces \
--unsmart --strip-blank --normalize-unicode NFC
```
The transforms run in the order you list them on the command line.
### 4. Redact PII before sharing a transcript
```bash
python3 scripts/redact.py transcript.txt safe.txt
# default kinds = all
# default placeholder = [REDACTED_{kind}_{i}]
```
```bash
# Only redact emails and phones, give the same email the same placeholder
python3 scripts/redact.py transcript.txt safe.txt \
--kinds email,phone --keep-counts
```
```bash
# Custom template
python3 scripts/redact.py log.txt safe.txt \
--token-template "<<{kind}#{i}>>"
```
```bash
# Pad placeholder to match original length (for fixed-width layouts)
python3 scripts/redact.py log.txt safe.txt --preserve-length
```
Credit-card matches are validated against the Luhn checksum so 16 random digits in a row don't trigger a false positive.
### 5. Line utilities
```bash
# Quick file stats
python3 scripts/lines.py haystack.txt --op count
# Drop duplicates, case-insensitive
python3 scripts/lines.py users.txt --op dedupe --case-insensitive --output unique.txt
# Numeric sort (so "100" > "23" > "7")
python3 scripts/lines.py scores.txt --op sort --numeric --reverse
# Deterministic shuffle
python3 scripts/lines.py prompts.txt --op shuffle --seed 42
# Look at the head and tail of a multi-gig log
python3 scripts/lines.py huge.log --op head -n 20
python3 scripts/lines.py huge.log --op tail -n 20
```
### 6. Word counts
```bash
# Basic stats
python3 scripts/wordcount.py essay.txt
# Top words with stopwords filter
python3 scripts/wordcount.py essay.txt --top 20 --ignore-case --stopwords stop.txt
# Machine-readable output
python3 scripts/wordcount.py essay.txt --top 10 --json > stats.json
```
### 7. Text diff
```bash
# Standard unified diff
python3 scripts/diff_text.py before.txt after.txt
# Side-by-side
python3 scripts/diff_text.py before.txt after.txt --mode side
# HTML report (colorized) for sharing
python3 scripts/diff_text.py before.txt after.txt --mode html --output diff.html
# Whitespace-insensitive compare
python3 scripts/diff_text.py before.txt after.txt --ignore-whitespace
```
## Exit codes
| Code | Meaning |
|---|---|
| 0 | success / one or more matches / files identical |
| 1 | zero matches / zero redactions / files differ / empty input |
| 2 | bad arguments / unsafe path / missing input / unknown kind / bad regex / unsupported output extension |
This 0 / 1 / 2 split is consistent across all six scripts so they slot into shell pipelines cleanly:
```bash
# Normalize, then redact, then count words in one shot
python3 scripts/normalize.py raw.txt clean.txt --to-unix --dehyphenate \
&& python3 scripts/redact.py clean.txt safe.txt \
&& python3 scripts/wordcount.py safe.txt --top 10
```
## Safety properties
- Pure Python 3 standard library. No third-party dependencies, no `pip install`.
- No `subprocess` calls. No shell invocation.
- All file paths are validated against a strict allowlist regex that rejects shell metacharacters (`;`, `|`, `&`, `>`, `<`, `$`, `` ` ``, etc.). The same `safe_path()` helper that powers `clean-csv-toolkit`.
- Scripts only read the input paths the caller provides and write to the output paths the caller provides.
- All inputs and outputs default to UTF-8; reads fall back through `utf-8-sig`, `cp1252`, `latin-1` if needed. Writes are always UTF-8.
- Deterministic where it matters: `shuffle --seed N` is reproducible; `extract` and `wordcount` always emit results in the same order for a given input.
## Performance
- `lines.py --op dedupe` processes 100,000 short lines (500 distinct) in ~0.06 s.
- `lines.py --op sort` processes 100,000 lines in ~0.10 s.
- `extract.py` scans the file in a single streaming pass — memory does not grow with file size.
## Known limitations
- The PII patterns are pragmatic heuristics, not strict RFC validators. The `email` regex accepts `user@host.tld` shapes but does not validate that `host.tld` resolves. `phone` accepts three telltale formats (`+<digits>`, `(XXX) XXX-XXXX`, `XXX-XXX-XXXX` / `XXX XXX XXXX`) so it doesn't grab IPs, dates, or credit-card numbers — but it will miss exotic local formats.
- `credit-card` uses the Luhn checksum, but `hex-token` (and similar high-recall patterns) intentionally over-match; review the count before sharing redacted output publicly.
- `diff_text.py --mode html` produces the standard `difflib.HtmlDiff` markup, which embeds inline styles. The file is portable but the styling is not customizable.
## v0.3.0 changes
- Added `scripts/replace.py`: sed-like find-and-replace with optional regex, capture-group back-references, multiple find/replace pairs in one pass, JSON `--rules` file, `--dry-run` preview with line:col context, `--max N` cap per rule, `--word` boundaries for literal mode.
- Fixed `extract.py`: `--kind url` was grabbing trailing sentence-punctuation (`.`, `)`, `,`, etc.) as part of the URL. Now strips a single trailing punctuation char so `Visit https://example.com.` correctly extracts `https://example.com` instead of `https://example.com.`.
- Fixed `slug.py`: `--text` mode with input that slugifies to an empty string (e.g. `"!!! @@@"`) now exits 1, matching the existing batch-mode behaviour. Previously it returned 0 silently.
## v0.2.0 changes
- Added `scripts/template.py`: no-Jinja2 template renderer. Three placeholder syntaxes (mustache `{{x}}`, dollar `${x}`, percent `%(x)s`), pipe filters, fallback defaults, and an optional `--strict` mode for CI. **Hand-rolled regex tokenizer, no `eval`, no `subprocess`.**
- Added `scripts/slug.py`: URL-safe slug generator. Single-string mode (prints to stdout) or batch mode (one slug per input line). Unicode-aware with optional ASCII transliteration via NFKD; `--keep-dots` for filename use; `--dedupe` for batch outputs.
- Added `scripts/markdown.py`: three-mode Markdown processor. `text` strips all markup; `html` renders a minimal HTML approximation (headings, paragraphs, lists, blockquotes, fenced code, links, images, bold/italic/code); `extract` pulls structured items (headings, links, images, code blocks, list items) as JSON / JSONL / TSV.
- All three new scripts share the same safe-path policy and 0 / 1 / 2 exit-code contract as the rest of the toolkit.
## v0.1.0 changes
- First public release of clean-text-toolkit.
- Six scripts: `extract.py`, `normalize.py`, `redact.py`, `lines.py`, `wordcount.py`, `diff_text.py`.
- Shared `_common.py` with `safe_path`, `read_text`, `iter_lines`, and `write_text` helpers (mirrors the design of `clean-csv-toolkit/scripts/_common.py`).
- Bug fixed during development: initial `phone` regex was too greedy and matched IPs / ISO dates / credit-card-with-spaces; tightened to three explicit shapes (international, parenthesized, 3-3-4 dashed) that don't collide with those other patterns. Tested against a mixed-content fixture with 5 valid phones and 3 confusable non-phones.
- Zero third-party dependencies; works on any system that ships Python 3.
## Pairs well with
- [`clean-csv-toolkit`](https://clawhub.ai/gopendrasharma89-tech/clean-csv-toolkit) — same author, same design philosophy (pure stdlib, exit-code contract, safe-path policy), for structured tabular data.
- [`openclaw-prompt-shield`](https://clawhub.ai/gopendrasharma89-tech/openclaw-prompt-shield) — pair `extract.py --kind email,url` with prompt-shield's redaction pipeline to scrub user-supplied text before passing it to an LLM.
## License
MIT
don't have the plugin yet? install it then click "run inline in claude" again.
restructured original into implexa's six-component format (intent, inputs, procedure with 10 numbered subsections, decision points, output contract, outcome signal), added edge cases and known limits, clarified exit codes and chaining behavior, preserved all ten scripts and original safety properties, removed marketing language and added punchy tech voice.
v0.3.0
local toolkit for the grind: read text, find structured bits, clean it up, redact secrets, pass it downstream. python 3 standard library only. no pandas, no nltk, no pip, no remote calls.
companion to clean-csv-toolkit: that handles tabular data, this handles unstructured text.
use clean-text-toolkit when you need to extract patterns from plain text (urls, emails, phones, ips, credit cards, dates, hashtags, color codes, money amounts), redact personally identifiable info (email, phone, ssn, credit cards, aws keys, jwts, uuids), normalize messy encoding and whitespace (bom, crlf, smart quotes, unicode normalization), or perform line-level operations (dedupe, sort, count, shuffle, head, tail). also handles word-frequency stats, text diffs in three modes (unified, side-by-side, html), template substitution without jinja2, url-safe slug generation, and markdown conversion to plain text or minimal html. run locally on your machine with zero external dependencies or network calls.
python3 binary (required, verified by scripts/check_deps.sh)--stopwords PATH: newline-delimited word list for wordcount.py--rules PATH: json file of find-and-replace rules for replace.py--set key=value for template.pyinputs: text file, pattern kind (url, email, phone, ipv4, ipv6, hashtag, mention, hex-color, money, iso-date), optional flags (--unique, --sort, --with-line)
python3 scripts/extract.py <input-file> --kind <kind> (stdout) or --output <file> (write to file)--with-line: prefix each match with its source line number--unique: dedupe matches before output--sort: alphabetically sort matchesoutputs: extracted items to stdout or file
inputs: text file, chain of transforms in command-line order
python3 scripts/normalize.py <input-file> <output-file> [transforms...]--trim: strip leading/trailing whitespace from each line--collapse-spaces: condense runs of spaces to single space--strip-blank: drop empty lines--to-unix: convert crlf and cr to lf--to-crlf: convert lf and cr to crlf--dehyphenate: rejoin hyphenated line-breaks (ocr/pdf artifact)--unsmart: convert smart quotes and em-dashes to ascii equivalents--strip-bom: remove byte-order mark--strip-zwsp: remove zero-width spaces and joiners--tabs-to-spaces N: replace tabs with N spaces--spaces-to-tabs N: replace N spaces with tabs--lower / --upper / --title: case conversion--normalize-unicode NFC|NFD|NFKC|NFKD: unicode normalization formoutputs: normalized text file
inputs: text file, pii kinds (email, phone, ipv4, ipv6, url, credit-card with luhn validation, ssn-us, uuid, hex-token 32+ hex chars, aws-access-key AKIA*, jwt with eyJ header), optional flags (--keep-counts, --preserve-length, --token-template)
python3 scripts/redact.py <input-file> <output-file> [--kinds email,phone,...][REDACTED_{kind}_{i}] or custom --token-template--keep-counts: identical pii values get identical placeholders (deterministic)--preserve-length: pad or truncate placeholder to match original lengthoutputs: redacted text file with pii replaced
inputs: text file, operation (count, dedupe, sort, shuffle, head, tail), optional flags (--case-insensitive, --keep first|last, --numeric, --reverse, --seed for shuffle, -n for head/tail count)
python3 scripts/lines.py <input-file> --op <operation> [flags] (stdout) or --output <file> (write to file)count: return total line count (streams, no buffering)head -n N: output first N lines (streams)tail -n N: output last N lines (streams)dedupe: drop duplicate lines (buffered in memory, case-insensitive if flag set, --keep first|last chooses which copy to keep)sort: sort lines alphabetically (buffered, --numeric sorts numerically, --reverse reverses order, --case-insensitive for case-insensitive sort)shuffle: randomize line order (buffered, --seed N for deterministic output)outputs: processed lines to stdout or file
inputs: text file, optional flags (--top N for most frequent words, --stopwords file, --min-length N, --ignore-case, --regex PATTERN for word boundaries, --json for machine-readable output)
python3 scripts/wordcount.py <input-file> [flags][A-Za-z']+ extracts words, override with --regex--stopwords: load file (one word per line) and exclude from top-word results--top N: output N most frequent words with counts--json: output stats as json object (word_count, char_count, line_count, sentence_count, top_words array)outputs: human-readable stats or json to stdout
inputs: before file, after file, mode (unified default, side, html), optional flags (--ignore-case, --ignore-whitespace, --context N)
python3 scripts/diff_text.py <before-file> <after-file> --mode <mode> [flags]--mode unified (default): output unified diff (compatible with patch tools)--mode side: output two-column side-by-side diff with context--mode html: output full html file with red/green coloring (write to --output file.html)--ignore-case: perform case-insensitive comparison--ignore-whitespace: treat all whitespace as equal--context N: include N lines of unchanged context around each changeoutputs: diff to stdout or html file
inputs: template file with placeholders, json object or inline --set key=value overrides
python3 scripts/template.py <template-file> [--set key=value...] [--json-file path]{{name}}${name}%(name)s{{name|upper}}, {{url|urlencode}}, etc. supported filters: upper, lower, title, strip, capitalize, reverse, len, escape-html, escape-json, urlencode{{name ?Unknown}}--set args--strict: exit 1 if any placeholder remains unresolved after substitutionoutputs: rendered text to stdout
no jinja2, no eval, hand-rolled regex tokenizer only
inputs: text string or file, optional flags (--separator char, --max-length N, --no-lower, --ascii for unicode transliteration, --keep-dots for filenames, --dedupe)
python3 scripts/slug.py --text "string" (single string mode, stdout) or python3 scripts/slug.py <input-file> --output <output-file> (batch mode)--no-lower)--ascii: transliterate unicode to ascii via nfkd decomposition-)--keep-dots: preserve dots in output (useful for filenames)--dedupe: drop duplicate separators--max-length if setoutputs: slug(s) to stdout or file
inputs: markdown file, mode (text, html, extract), optional flags (--link-style anchor|url|both for text mode)
python3 scripts/markdown.py <input-file> --mode <mode> [flags]--mode text: strip markdown to plain text (--link-style controls [text](url) rendering: anchor = text, url = url, both = text (url))--mode html: render minimal html approximation (headings, paragraphs, lists, blockquotes, fenced code, links, images, bold/italic/code)--mode extract: output structured items (headings, links, images, code blocks, list items) as json (--output file.json), jsonl (--output file.jsonl), or tsv (--output file.tsv)outputs: plain text, html file, or json/jsonl/tsv to stdout or file
inputs: text file, find pattern(s), replace value(s), optional flags (--dry-run, --max N, --word for word boundaries in literal mode, --rules json file)
python3 scripts/replace.py <input-file> <output-file> --find pattern --replace value [--find ... --replace ...]--rules path.json with array of {find, replace, mode, ...} objectsregex (default, supports capture groups \1, \2), literal (exact string match), word (literal with word boundaries)--dry-run: preview matches with line:col and context, don't write--max N: cap replacements per ruleoutputs: text file with replacements applied
api key / remote auth not needed: all operations are local and stateless. no decision branching required.
output format choice: extract.py, lines.py, wordcount.py, markdown.py support multiple output formats (txt, json, jsonl, tsv, html). caller specifies via --output file.ext or defaults to stdout.
if input file is empty: most scripts exit 1 (no matches, no lines, no words). normalize.py and template.py exit 0 even on empty input (they produce empty output, which is valid).
if no matches for extraction or redaction: exit code 1, so scripts chain cleanly in shell conditionals. extract.py --kind email app.log && echo "found emails" will not echo if zero emails exist.
if redaction produces zero changes: exit 1, signaling caller that no pii was found. useful for ci pipelines that require proof redaction happened.
if replace.py produces zero replacements: exit 1. use in ci to fail if a pattern doesn't match expected input.
if pii validation fails (e.g., luhn checksum for credit cards): pattern is skipped, not redacted. false negatives are preferred over false positives.
if placeholder resolves to empty string (e.g., template.py with missing required key in non-strict mode): output empty string. in --strict mode, exit 1 instead.
if file encoding detection fails: attempt utf-8, utf-8-sig, cp1252, latin-1 in order. if all fail, script exits 2 (bad args / unreadable file).
if output path is unsafe (contains shell metacharacters, directory traversal, etc.): exit 2 and refuse to write. validated by strict allowlist regex that rejects ;, |, &, >, <, $, backtick, and relative paths outside cwd.
if stopwords file does not exist (wordcount.py): exit 2 and report missing file.
if rules json is malformed (replace.py): exit 2 and report parse error.
dedupe and sort return lines in stable order; shuffle is randomized (unless --seed provided); head and tail stream without buffering.all outputs are utf-8. all file paths are validated against safe-path policy (no shell metacharacters, no directory traversal).
diff or diff_text.py if unsure.[REDACTED_email_1] in place of email addresses. if no redactions occur, exit code 1 warns that no pii matched (verify pattern assumptions).+ (added), - (removed), context lines. side-by-side shows two columns with pipes marking diffs. html file opens in browser with red/green coloring. exit 0 = files identical, exit 1 = files differ.{{name}} or ${name} patterns in output (indicate missing keys). in --strict mode, missing keys cause exit 1 (fail fast).curl or paste into browser address bar if in doubt.#, *, [link](url)). in extract mode, json output lists headings, links, images as objects. compare with markdown source visually.+<digits>, (XXX) XXX-XXXX, XXX-XXX-XXXX, XXX XXX XXXX but misses exotic local formats. tune with custom regex in replace.py if needed.{{ without intent to substitute, escape as {{ or rewrite template to avoid collision. strict mode (--strict flag) forces all placeholders to resolve.;, |, &, >, <, $, backtick, and relative paths (e.g., ../../../etc/passwd). all output paths must be within current working directory or absolute paths.# extract all emails, dedupe, sort
python3 scripts/extract.py app.log --kind email --unique --sort
# clean up ocr garbage
python3 scripts/normalize.py scanned.txt clean.txt --strip-bom --to-unix --dehyphenate --collapse-spaces
# redact everything, keep counts deterministic
python3 scripts/redact.py transcript.txt safe.txt --keep-counts
# drop duplicate lines, case-insensitive
python3 scripts/lines.py users.txt --op dedupe --case-insensitive --output unique.txt
# top 20 words, ignore case, load stopwords
python3 scripts/wordcount.py essay.txt --top 20 --ignore-case --stopwords stop.txt
# html diff for sharing
python3 scripts/diff_text.py before.txt after.txt --mode html --output diff.html
# template with mustache syntax
python3 scripts/template.py msg.txt --set name=Alice --set role=Engineer
# generate url-safe slug from filename
python3 scripts/slug.py --text "Hello World (2024)" --keep-dots
# strip markdown to plain text
python3 scripts/markdown.py readme.md --mode text
# find and replace with regex capture groups
python3 scripts/replace.py log.txt clean.log --find '(\d{4})-(\d{2})-(\d{2})' --replace '\3/\2/\1'
| code | meaning |
|---|---|
| 0 | success, matches found, operation completed, files identical |
| 1 | zero results, empty input, files differ, unresolved placeholders (strict mode) |
| 2 | bad arguments, unsafe path, missing input, unknown kind, bad regex, parse error, unsupported output extension |
chaining works: normalize.py && redact.py && wordcount.py halts on first failure (exit 2) or continues on empty result (exit 1) depending on your shell set -e policy.
;, |, &, >, <, $, backtick, ', "), directory traversal (..), and relative paths outside cwd.lines.py --op dedupe on 100,000 short lines (500 distinct): ~60ms.lines.py --op sort on 100,000 lines: ~100ms.extract.py scans file in single streaming pass; memory constant regardless of file size (streams one line at a time).wordcount.py streams line-by-line for char/line/sentence counts; buffers only unique words for frequency ranking.redact.py single-pass scan with regex matching; memory grows with number of unique pii values (for --keep-counts dedup table).clean-csv-toolkit: same author, same stdlib-only design, handles tabular data.openclaw-prompt-shield: chain extract.py --kind email,url output into prompt-shield to scrub user input before llm calls.MIT
original author: gopendrasharma89-tech