Clean Text Toolkit

Local text cleanup and inspection toolkit. Extract structured items (URLs, emails, phones, IPs, dates, hashtags, money), redact PII (email/phone/credit-card-...

installs

stars

karma

SkillRank score ↗

8.4/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

clean-text-toolkit extracts structured patterns (urls, emails, phones, ips, dates, money), redacts pii with luhn validation, normalizes text (unicode, whitespace, quotes), and provides line utilities, word stats, diffs, templating, and markdown processing. pure python 3 stdlib, no dependencies, no remote calls.

structure

9.0

trigger phrases

8.0

procedure

9.0

edge cases

8.0

documentation

8.0

view original SKILL.md from clawhubclick to expand

---
name: clean-text-toolkit
description: Local text cleanup and inspection toolkit. Extract structured items (URLs, emails, phones, IPs, dates, hashtags, money), redact PII (email/phone/credit-card-with-Luhn/SSN/JWT/AWS keys/UUIDs), normalize (BOM/CRLF/smart-quotes/whitespace/tabs/case/Unicode NFC), line utilities (count/dedupe/sort/shuffle/head/tail), word-frequency stats with stopwords, three-mode text diffs (unified/side/HTML), no-Jinja2 template renderer with filters and defaults, URL-safe slug generator, and Markdown converter (strip-to-text / minimal HTML / extract headings/links/images/code/lists). Pure Python 3 standard library, no third-party dependencies, no remote calls.
license: MIT
metadata: {"openclaw":{"requires":{"bins":["python3"]},"primaryEnv":null,"homepage":"https://clawhub.ai/gopendrasharma89-tech/clean-text-toolkit"}}
---

# clean-text-toolkit

v0.4.0

A small, honest local toolkit for the work agents end up doing constantly: read some text someone sent you, find the structured bits, clean it up, redact the secrets, and forward it downstream. Built on Python 3 standard library only. No `pandas`, no `nltk`, no pip installs, no remote calls.

This skill is the companion to [`clean-csv-toolkit`](https://clawhub.ai/gopendrasharma89-tech/clean-csv-toolkit): that one handles structured tabular data, this one handles unstructured text.

## What this skill does

- `scripts/extract.py` — pull structured items out of any text file. Kinds: `url`, `email`, `phone`, `ipv4`, `ipv6`, `hashtag`, `mention`, `hex-color`, `money`, `iso-date`. Output to stdout (one-per-line or JSON), or to a `.txt` / `.json` / `.jsonl` file. Optional `--unique`, `--sort`, `--with-line` (prefix with the source line number).
- `scripts/normalize.py` — clean up messy text. Chainable transforms applied in command-line order: `--trim`, `--collapse-spaces`, `--strip-blank`, `--to-unix`, `--to-crlf`, `--dehyphenate` (rejoin OCR/PDF hyphenated line-breaks), `--unsmart` (smart quotes / em-dashes → ASCII), `--strip-bom`, `--strip-zwsp` (zero-width spaces and joiners), `--tabs-to-spaces N`, `--spaces-to-tabs N`, `--lower` / `--upper` / `--title`, `--normalize-unicode NFC|NFD|NFKC|NFKD`.
- `scripts/redact.py` — anonymize text by replacing PII-like patterns with placeholder tokens. Kinds: `email`, `phone`, `ipv4`, `ipv6`, `url`, `credit-card` (with Luhn validation to suppress false positives), `ssn-us`, `uuid`, `hex-token` (32+ hex chars, typical for tokens / hashes), `aws-access-key` (AKIA…), `jwt` (three base64url segments with the `eyJ` header). `--keep-counts` makes the same value always get the same placeholder; `--preserve-length` pads/truncates the placeholder to the original length.
- `scripts/lines.py` — line-oriented utilities. `--op count | dedupe | sort | shuffle | head | tail`. Streams `count`, `head`, `tail`. `dedupe` and `sort` are O(N) memory in the number of lines, but each line is small so 1 M lines is fine on a laptop. `--case-insensitive`, `--keep first|last`, `--numeric`, `--reverse`, `--seed` for deterministic shuffles.
- `scripts/wordcount.py` — word / character / line / sentence statistics. Optional `--top N` for most-frequent words, `--stopwords PATH`, `--min-length N`, `--ignore-case`, `--regex PATTERN` (default `[A-Za-z']+`).
- `scripts/diff_text.py` — three-mode text diff using stdlib `difflib`. `--mode unified` (default), `--mode side` (custom two-column layout), `--mode html` (writes a full HTML file with red/green coloring). `--ignore-case`, `--ignore-whitespace`, `--context N`.
- `scripts/template.py` (NEW in v0.2.0) — substitute placeholders in a text file with values from a JSON object or inline `--set key=value` overrides. Mustache (`{{name}}`), dollar (`${name}`), or percent (`%(name)s`) syntax. Filters: upper, lower, title, strip, capitalize, reverse, len, escape-html, escape-json, urlencode. Default values: `{{name ?Unknown}}`. Strict mode (`--strict`) exits 1 if any placeholder is unresolved. **No Jinja2, no `eval`.**
- `scripts/slug.py` (NEW in v0.2.0) — turn strings into URL-safe slugs. Single string mode (`--text "Hello World"`) or batch mode (line-in-file -> line-out-file). Options: `--separator`, `--max-length`, `--no-lower`, `--ascii` (Unicode -> ASCII transliteration via NFKD), `--keep-dots` (useful for filenames), `--dedupe`.
- `scripts/markdown.py` (NEW in v0.2.0) — strip Markdown to plain text, render a minimal HTML approximation, or extract structured items (headings, links, images, code blocks, list items) as JSON / JSONL / TSV. For text mode, `--link-style anchor|url|both` controls how `[text](url)` is rendered.
- `scripts/replace.py` (NEW in v0.3.0) — find-and-replace with regex / literal / word-boundary modes, capture-group back-references (`\1`, `\2`), multiple `--find/--replace` pairs in a single pass, or a JSON `--rules` file with per-rule settings. `--dry-run` previews matches with line:col and context; `--max N` caps replacements per rule. Returns exit 1 when zero replacements happen so it slots into CI.
- `scripts/htmlstrip.py` (NEW in v0.4.0) — strip HTML tags from scraped pages. Three modes: `text` (collapse to plain readable text, drop `<script>`/`<style>` content, preserve line breaks at block tags), `html` (sanitize — remove `script,style,iframe,object,embed,form,input` tags + all `on*` event-handler attributes + inline `style=`, keep the rest intact), `extract` (pull links/images/headings/tables as JSON/JSONL/TSV). Built on Python stdlib `html.parser`. The single most-asked-for agent capability: turn scraped HTML into something useful in one command.
- `scripts/check_deps.sh` — verify `python3` is available.

## What this skill does not do

- It does not call any LLM, web service, or remote API.
- It does not load entire files into memory unless an operation truly needs the whole file (full-content normalization, sort-and-write, diff). Streaming-friendly operations (`extract`, `lines --op count|head|tail`, `wordcount` for chars/lines counters) read one line at a time.
- It does not write outside the input/output paths the caller provides.

## Quick start

### 1. Pull every email out of a log file

```bash
python3 scripts/extract.py app.log --kind email --unique --sort
python3 scripts/extract.py app.log --kind email --output emails.txt --unique
```

### 2. Find every URL and tag it with the source line

```bash
python3 scripts/extract.py article.md --kind url --with-line
```

### 3. Clean up a messy OCR dump

```bash
python3 scripts/normalize.py scanned.txt clean.txt \
    --strip-bom --to-unix --dehyphenate --collapse-spaces \
    --unsmart --strip-blank --normalize-unicode NFC
```

The transforms run in the order you list them on the command line.

### 4. Redact PII before sharing a transcript

```bash
python3 scripts/redact.py transcript.txt safe.txt
# default kinds = all
# default placeholder = [REDACTED_{kind}_{i}]
```

```bash
# Only redact emails and phones, give the same email the same placeholder
python3 scripts/redact.py transcript.txt safe.txt \
    --kinds email,phone --keep-counts
```

```bash
# Custom template
python3 scripts/redact.py log.txt safe.txt \
    --token-template "<<{kind}#{i}>>"
```

```bash
# Pad placeholder to match original length (for fixed-width layouts)
python3 scripts/redact.py log.txt safe.txt --preserve-length
```

Credit-card matches are validated against the Luhn checksum so 16 random digits in a row don't trigger a false positive.

### 5. Line utilities

```bash
# Quick file stats
python3 scripts/lines.py haystack.txt --op count

# Drop duplicates, case-insensitive
python3 scripts/lines.py users.txt --op dedupe --case-insensitive --output unique.txt

# Numeric sort (so "100" > "23" > "7")
python3 scripts/lines.py scores.txt --op sort --numeric --reverse

# Deterministic shuffle
python3 scripts/lines.py prompts.txt --op shuffle --seed 42

# Look at the head and tail of a multi-gig log
python3 scripts/lines.py huge.log --op head -n 20
python3 scripts/lines.py huge.log --op tail -n 20
```

### 6. Word counts

```bash
# Basic stats
python3 scripts/wordcount.py essay.txt

# Top words with stopwords filter
python3 scripts/wordcount.py essay.txt --top 20 --ignore-case --stopwords stop.txt

# Machine-readable output
python3 scripts/wordcount.py essay.txt --top 10 --json > stats.json
```

### 7. Text diff

```bash
# Standard unified diff
python3 scripts/diff_text.py before.txt after.txt

# Side-by-side
python3 scripts/diff_text.py before.txt after.txt --mode side

# HTML report (colorized) for sharing
python3 scripts/diff_text.py before.txt after.txt --mode html --output diff.html

# Whitespace-insensitive compare
python3 scripts/diff_text.py before.txt after.txt --ignore-whitespace
```

## Exit codes

| Code | Meaning |
|---|---|
| 0 | success / one or more matches / files identical |
| 1 | zero matches / zero redactions / files differ / empty input |
| 2 | bad arguments / unsafe path / missing input / unknown kind / bad regex / unsupported output extension |

This 0 / 1 / 2 split is consistent across all six scripts so they slot into shell pipelines cleanly:

```bash
# Normalize, then redact, then count words in one shot
python3 scripts/normalize.py raw.txt clean.txt --to-unix --dehyphenate \
  && python3 scripts/redact.py clean.txt safe.txt \
  && python3 scripts/wordcount.py safe.txt --top 10
```

## Safety properties

- Pure Python 3 standard library. No third-party dependencies, no `pip install`.
- No `subprocess` calls. No shell invocation.
- All file paths are validated against a strict allowlist regex that rejects shell metacharacters (`;`, `|`, `&`, `>`, `<`, `$`, `` ` ``, etc.). The same `safe_path()` helper that powers `clean-csv-toolkit`.
- Scripts only read the input paths the caller provides and write to the output paths the caller provides.
- All inputs and outputs default to UTF-8; reads fall back through `utf-8-sig`, `cp1252`, `latin-1` if needed. Writes are always UTF-8.
- Deterministic where it matters: `shuffle --seed N` is reproducible; `extract` and `wordcount` always emit results in the same order for a given input.

## Performance

- `lines.py --op dedupe` processes 100,000 short lines (500 distinct) in ~0.06 s.
- `lines.py --op sort` processes 100,000 lines in ~0.10 s.
- `extract.py` scans the file in a single streaming pass — memory does not grow with file size.

## Known limitations

- The PII patterns are pragmatic heuristics, not strict RFC validators. The `email` regex accepts `user@host.tld` shapes but does not validate that `host.tld` resolves. `phone` accepts three telltale formats (`+<digits>`, `(XXX) XXX-XXXX`, `XXX-XXX-XXXX` / `XXX XXX XXXX`) so it doesn't grab IPs, dates, or credit-card numbers — but it will miss exotic local formats.
- `credit-card` uses the Luhn checksum, but `hex-token` (and similar high-recall patterns) intentionally over-match; review the count before sharing redacted output publicly.
- `diff_text.py --mode html` produces the standard `difflib.HtmlDiff` markup, which embeds inline styles. The file is portable but the styling is not customizable.

## v0.4.0 changes

- Added `scripts/htmlstrip.py`: HTML → plain text / sanitized HTML / structured extract. Built on stdlib `html.parser`. Three modes (text / html / extract), keeps links optionally, drops `<script>/<style>/<noscript>` content entirely in text mode, removes `on*` event-handler attributes in sanitize mode. Extract mode pulls links, images, headings, and full table data as JSON.
- Specifically designed for agents that scrape web pages: one command turns a raw HTML dump into plain text or a structured links/images/tables JSON.
- Same safe-path policy and 0/1/2 exit-code contract as the rest of the toolkit.

## v0.3.0 changes

- Added `scripts/replace.py`: sed-like find-and-replace with optional regex, capture-group back-references, multiple find/replace pairs in one pass, JSON `--rules` file, `--dry-run` preview with line:col context, `--max N` cap per rule, `--word` boundaries for literal mode.
- Fixed `extract.py`: `--kind url` was grabbing trailing sentence-punctuation (`.`, `)`, `,`, etc.) as part of the URL. Now strips a single trailing punctuation char so `Visit https://example.com.` correctly extracts `https://example.com` instead of `https://example.com.`.
- Fixed `slug.py`: `--text` mode with input that slugifies to an empty string (e.g. `"!!! @@@"`) now exits 1, matching the existing batch-mode behaviour. Previously it returned 0 silently.

## v0.2.0 changes

- Added `scripts/template.py`: no-Jinja2 template renderer. Three placeholder syntaxes (mustache `{{x}}`, dollar `${x}`, percent `%(x)s`), pipe filters, fallback defaults, and an optional `--strict` mode for CI. **Hand-rolled regex tokenizer, no `eval`, no `subprocess`.**
- Added `scripts/slug.py`: URL-safe slug generator. Single-string mode (prints to stdout) or batch mode (one slug per input line). Unicode-aware with optional ASCII transliteration via NFKD; `--keep-dots` for filename use; `--dedupe` for batch outputs.
- Added `scripts/markdown.py`: three-mode Markdown processor. `text` strips all markup; `html` renders a minimal HTML approximation (headings, paragraphs, lists, blockquotes, fenced code, links, images, bold/italic/code); `extract` pulls structured items (headings, links, images, code blocks, list items) as JSON / JSONL / TSV.
- All three new scripts share the same safe-path policy and 0 / 1 / 2 exit-code contract as the rest of the toolkit.

## v0.1.0 changes

- First public release of clean-text-toolkit.
- Six scripts: `extract.py`, `normalize.py`, `redact.py`, `lines.py`, `wordcount.py`, `diff_text.py`.
- Shared `_common.py` with `safe_path`, `read_text`, `iter_lines`, and `write_text` helpers (mirrors the design of `clean-csv-toolkit/scripts/_common.py`).
- Bug fixed during development: initial `phone` regex was too greedy and matched IPs / ISO dates / credit-card-with-spaces; tightened to three explicit shapes (international, parenthesized, 3-3-4 dashed) that don't collide with those other patterns. Tested against a mixed-content fixture with 5 valid phones and 3 confusable non-phones.
- Zero third-party dependencies; works on any system that ships Python 3.

## Pairs well with

- [`clean-csv-toolkit`](https://clawhub.ai/gopendrasharma89-tech/clean-csv-toolkit) — same author, same design philosophy (pure stdlib, exit-code contract, safe-path policy), for structured tabular data.
- [`openclaw-prompt-shield`](https://clawhub.ai/gopendrasharma89-tech/openclaw-prompt-shield) — pair `extract.py --kind email,url` with prompt-shield's redaction pipeline to scrub user-supplied text before passing it to an LLM.

## License

MIT

don't have the plugin yet? install it then click "run inline in claude" again.

added explicit intent, inputs, decision points, output contract, and outcome signal sections; documented all 11 script operations as numbered procedure steps with I/O; called out edge cases (encoding fall

clean-text-toolkit

Item: Clean Text Toolkit
Rating: 8.4
Author: Implexa

intent

this skill reads unstructured text, extracts structured bits (URLs, emails, phone numbers, IPs, dates, money amounts, hashtags), redacts PII patterns (email, phone, credit card, SSN, JWT, AWS keys, UUIDs), normalizes messy encoding (BOM, CRLF, smart quotes, whitespace, Unicode), and performs text ops (line dedup, sort, shuffle, word frequency, text diff, templating, markdown/HTML parsing). built on Python 3 standard library only. no pip installs, no remote API calls, no LLM dependencies. use this when you need to munge unstructured text locally without leaving your network.

inputs

Required:

Python 3.7+ (any distro) available as python3 command
input text file (any encoding; falls back through utf-8-sig, cp1252, latin-1)

Optional external connections:

none. all operations are local and stateless.

Environment variables:

none required. all config via command-line flags.

Stopwords file (optional):

for wordcount.py --stopwords /path/to/stop.txt, provide a newline-delimited list of words to exclude from frequency counts.

JSON rules file (optional):

for replace.py --rules /path/to/rules.json, provide a JSON file with find-and-replace rules (see procedure step 8 for format).

procedure

extract structured items (scripts/extract.py INPUT [--output FILE] [--kind TYPE] [--unique] [--sort] [--with-line] [--json])
- input: text file (any size, streamed)
- kinds: url, email, phone, ipv4, ipv6, hashtag, mention, hex-color, money, iso-date
- flags: --unique dedupes results, --sort alpha-sorts, --with-line prefixes line number, --json outputs JSON array instead of one-per-line
- output: stdout (default) or file (inferred from --output extension: .txt, .json, .jsonl)
- exit code: 0 if matches found, 1 if zero matches, 2 if bad args or unsafe path
normalize text (scripts/normalize.py INPUT OUTPUT [TRANSFORMS...])
- input: text file
- output: file (always UTF-8)
- transforms apply in order you specify them: --trim (strip leading/trailing), --collapse-spaces, --strip-blank (drop empty lines), --to-unix (LF only), --to-crlf, --dehyphenate (rejoin OCR line-breaks), --unsmart (smart quotes and em-dashes to ASCII), --strip-bom, --strip-zwsp (zero-width spaces), --tabs-to-spaces N, --spaces-to-tabs N, --lower, --upper, --title, --normalize-unicode NFC|NFD|NFKC|NFKD
- exit code: 0 on success, 2 if bad args or unsafe path
redact PII (scripts/redact.py INPUT OUTPUT [--kinds TYPE[,TYPE...]] [--token-template TEMPLATE] [--keep-counts] [--preserve-length])
- input: text file
- output: file with PII replaced by tokens (default token: [REDACTED_{kind}_{i}])
- kinds (default: all): email, phone, ipv4, ipv6, url, credit-card (Luhn-validated), ssn-us, uuid, hex-token (32+ hex chars), aws-access-key (AKIA*), jwt (eyJ* base64url)
- flags: --keep-counts makes same PII value get same token, --preserve-length pads/truncates token to original length
- credit-card validation via Luhn algorithm prevents false positives on random digit sequences
- exit code: 0 if redactions made, 1 if zero redactions, 2 if bad args
line operations (scripts/lines.py INPUT --op OP [--output FILE] [--case-insensitive] [--numeric] [--reverse] [-n N] [--seed S])
- input: text file (streamed for count/head/tail, buffered for dedupe/sort/shuffle)
- operations: count (line count, streams), dedupe (unique lines, buffers), sort (alpha sort, buffers), shuffle (randomize, buffers), head (first N lines), tail (last N lines)
- flags: --case-insensitive, --numeric (numeric sort), --reverse, -n N (for head/tail), --seed S (for reproducible shuffle), --keep first|last (for dedupe conflicts)
- exit code: 0 on success, 1 if zero lines
word frequency stats (scripts/wordcount.py INPUT [--top N] [--stopwords FILE] [--ignore-case] [--min-length N] [--regex PATTERN] [--json])
- input: text file (streamed)
- outputs: char count, line count, word count, sentence count, and optionally --top N most-frequent words
- flags: --stopwords FILE excludes words in the file, --ignore-case, --min-length N, --regex PATTERN (default [A-Za-z']+)
- output: human-readable (default) or --json for machine parsing
- exit code: 0 on success, 2 if bad args
text diff (scripts/diff_text.py FILE1 FILE2 [--mode unified|side|html] [--output FILE] [--ignore-case] [--ignore-whitespace] [--context N])
- input: two text files
- modes: unified (default, standard diff format), side (custom two-column layout), html (full HTML report with red/green coloring)
- flags: --ignore-case, --ignore-whitespace, --context N (lines of context around changes)
- output: stdout (default) or --output FILE.html for HTML mode
- exit code: 0 if files identical, 1 if files differ, 2 if bad args
templating (scripts/template.py INPUT OUTPUT --data FILE.json [--set key=value ...] [--strict])
- input: text file with placeholders
- placeholder syntaxes: mustache {{name}}, dollar ${name}, percent %(name)s
- filters (pipe syntax): upper, lower, title, strip, capitalize, reverse, len, escape-html, escape-json, urlencode
- defaults: {{name ?fallback}} uses fallback if name undefined
- data source: JSON file (--data) or inline --set name=value (sets override file)
- flags: --strict exits 1 if any placeholder unresolved
- implementation: hand-rolled regex parser, no eval, no Jinja2
- exit code: 0 on success, 1 if strict mode and unresolved, 2 if bad args
find-and-replace (scripts/replace.py INPUT OUTPUT [--find PATTERN --replace REPLACEMENT ...] [--rules FILE.json] [--mode literal|word|regex] [--dry-run] [--max N])
- input: text file
- patterns: regex or literal, supports capture-group back-references (\1, \2, etc.)
- modes: literal (exact string match), word (literal + word boundaries), regex (default)
- multiple pairs: --find X --replace Y --find A --replace B in a single pass
- rules file format: JSON array [{"find": "...", "replace": "...", "mode": "literal|word|regex"}, ...]
- flags: --dry-run previews matches with line:col and 40-char context, --max N caps replacements per rule
- output: file with replacements applied
- exit code: 0 if replacements made, 1 if zero replacements, 2 if bad args or bad regex
Markdown processing (scripts/markdown.py INPUT [--output FILE] [--mode text|html|extract] [--link-style anchor|url|both])
- input: Markdown file
- modes: text (strip markup to plain text), html (render minimal HTML), extract (pull structured data)
- extract outputs: links, images, headings, code blocks, list items as JSON/JSONL/TSV
- link-style (for text mode): anchor (just display text), url (just URL), both (text + URL)
- exit code: 0 on success, 2 if bad args
HTML stripping (scripts/htmlstrip.py INPUT [--output FILE] [--mode text|html|extract])
- input: HTML file (from scraped pages, etc.)
- modes: text (plain readable text, drop <script>/<style>, preserve block tag line breaks), html (sanitize by removing script,style,iframe,object,embed,form,input tags + all on* event handlers + inline style=), extract (pull links, images, headings, tables as JSON/JSONL/TSV)
- implementation: stdlib html.parser, no BeautifulSoup
- exit code: 0 on success, 2 if bad args
slug generation (scripts/slug.py [--text STRING | INPUT] [--output FILE] [--separator CHAR] [--max-length N] [--no-lower] [--ascii] [--keep-dots] [--dedupe])
- modes: --text STRING (single string, prints to stdout), batch mode (one slug per input line)
- flags: --separator (default -), --max-length (truncate), --no-lower (keep case), --ascii (transliterate Unicode via NFKD), --keep-dots (for filenames), --dedupe (batch mode: drop duplicate slugs)
- output: stdout or file
- exit code: 0 on success, 1 if slug result is empty string, 2 if bad args

decision points

if using redact.py with credit-card patterns: the Luhn checksum validation is applied automatically. if you get too many false positives with hex-token, tighten the pattern with --kinds email,phone,ssn-us instead of relying on the all-kinds default.

if using extract.py and zero matches found: exit code is 1, not 0. this allows chaining commands in shell with && (only proceed if matches exist).

if the input file has mixed encoding (e.g. UTF-8 with pockets of latin-1): the read_text helper tries utf-8-sig first, then cp1252, then latin-1. it will not throw on mixed encodings; it decodes as far as possible.

if using normalize.py with conflicting flags: --to-unix and --to-crlf can both be specified; whichever comes last wins. transforms apply in order, so --lower --upper results in uppercase (the last one wins).

if using lines.py with --op dedupe: if a line appears 1000 times, only one occurrence is kept. use --keep first (default) to keep the first occurrence, or --keep last to keep the last.

if using wordcount.py on a very large file (1 GB+): the scan is line-by-line, but building the frequency dict is O(N) memory. on a typical laptop with 8 GB RAM, you can handle 10s of millions of words. if you hit memory, pipe through head or split the file first.

if using replace.py --dry-run: the script exits without writing the output file. use this to preview matches before committing.

if using template.py in strict mode and a placeholder is unresolved: the script exits 1 and does not write the output file. use --set key=value to provide the missing values, or remove the --strict flag.

if using htmlstrip.py extract mode on a table: the output is flat JSON rows. for complex multi-level tables, manually post-process the JSON.

if using slug.py --ascii with emojis or symbols: non-transliteratable chars are stripped entirely. "🎉 Party!".slug() becomes party.

output contract

extract.py:

stdout (default) or --output FILE
format: one match per line (newline-delimited) or --json for JSON array
file extension inferred: .txt, .json, .jsonl

normalize.py:

output file (required argument)
always UTF-8, never BOM
line endings depend on final transform (--to-unix or --to-crlf)

redact.py:

output file (required argument)
UTF-8, same line structure as input
placeholders follow template: default [REDACTED_{kind}_{i}] or custom via --token-template

lines.py:

stdout (default) or --output FILE
one line per output line, trailing newline preserved
for --op count, single integer

wordcount.py:

stdout (human-readable) or --json for JSON object
fields: chars, lines, words, sentences, optionally top_words (list of [word, count] pairs)

diff_text.py:

unified: standard diff format to stdout or --output FILE
side: custom two-column text layout
html: full HTML file with inline styles, written to --output FILE.html

template.py:

output file (required argument)
UTF-8
all resolved placeholders replaced, unresolved ones left as-is (unless --strict)

replace.py:

output file (required argument)
UTF-8
input with replacements applied
dry-run: stdout preview with line:col context, no file written

markdown.py / htmlstrip.py:

text mode: plain text to stdout or file
html mode: HTML to stdout or file
extract mode: JSON/JSONL/TSV to stdout or file (format inferred from --output extension)

slug.py:

single-string mode: result to stdout
batch mode: slugs to stdout (one per line) or file
all lowercase (unless --no-lower), hyphens, ASCII-safe

outcome signal

exit code 0: operation succeeded. for extract/redact/replace, 0 means at least one match or redaction or replacement was made. for diff, 0 means files are identical or operation completed.
exit code 1: zero matches / zero redactions / zero replacements found, or files differ. this is not an error; it signals the condition to the calling script so it can branch (e.g. extract.py && echo "found" || echo "nothing").
exit code 2: bad arguments, unsafe file path, missing input file, unknown kind/mode, bad regex, or unsupported output extension. this is a true error.
file written: for operations with --output, check that the file exists and has content. for --dry-run, no output file is written.
stdout message: most scripts are silent on success. errors print to stderr (e.g. "bad regex: ..."). wordcount prints stats to stdout by default.
line count: lines.py --op count prints a single integer to stdout.

implementation notes

safety: all file paths validated against strict regex that rejects shell metacharacters (;, |, &, >, <, $, backtick). no subprocess calls, no shell invocation, no eval.
performance: streaming for extract, count, head, tail, wordcount (memory does not grow with file size). buffered for sort, dedupe, shuffle, diff (lines are typically small; 1M lines fits in RAM on a laptop).
determinism: --seed N for shuffle, extract/wordcount always emit in same order for given input.
patterns are pragmatic heuristics: email regex accepts user@host.tld but does not validate DNS. phone accepts +1234567890, (123) 456-7890, 123-456-7890, 123 456 7890 but will miss exotic local formats. credit-card validated via Luhn to suppress false positives on random digit sequences. hex-token intentionally over-matches (32+ hex chars); review counts before sharing redacted output publicly.
rate limits, API calls, auth: none. all operations are local and stateless.

credits: built by gopendrasharma89-tech. pairs well with clean-csv-toolkit (same author, same design philosophy).

Clean Text Toolkit

related skills

clean-text-toolkit

intent

inputs

procedure

decision points

output contract

outcome signal

implementation notes