clawhub

Redacta

Item: Redacta
Rating: 8.2
Author: Implexa

Pseudonymises medical and clinical documents by replacing patient identifiers with labelled tokens (e.g. [PATIENT_NAME_1], [NHS_NUMBER_1], [DATE_OF_BIRTH_1])...

view source

installs

stars

karma

SkillRank score ↗

8.2/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-06-12

redacta pseudonymises medical documents by replacing patient identifiers with labelled tokens, combining deterministic pattern matching (nhs numbers, dates, postcodes) with contextual reasoning for names and addresses. returns redacted text plus token map for re-identification.

structure

9.0

trigger phrases

8.0

procedure

9.0

edge cases

7.0

documentation

8.0

view original SKILL.md from clawhubclick to expand

---
name: redacta
version: 1.2.0
description: Pseudonymises medical and clinical documents by replacing patient identifiers with labelled tokens (e.g. [PATIENT_NAME_1], [NHS_NUMBER_1], [DATE_OF_BIRTH_1]) so the text can be safely processed by AI or shared, with clinical meaning intact. Combines a deterministic pattern layer (NHS numbers with Modulus-11 validation, UK National Insurance numbers, dates of birth, UK postcodes, phone numbers, emails, hospital/MRN numbers) with contextual reasoning for patient names, postal addresses and identifying ages, then returns the redacted document plus a redaction report. Use when the user wants to redact, de-identify, anonymise or pseudonymise a medical letter, clinical note, discharge summary, referral or patient record, or before pasting clinical text into another AI tool. Can also re-identify (reverse the redaction) by restoring original values from a token map, and offers a stricter HIPAA Safe Harbor mode for US de-identification (all dates, ages, and the remaining HIPAA identifiers).
license: MIT-0
---

# Redacta

Pseudonymise medical text before it is processed by AI or shared: replace patient
identifiers with labelled tokens (`[PATIENT_NAME_1]`, `[NHS_NUMBER_1]`,
`[DATE_OF_BIRTH_1]`, ...) while leaving the clinical meaning untouched. Return the
redacted document plus a redaction report.

Redacta works in two layers:

- **Layer 1 — patterns (deterministic).** Fixed-format identifiers, matched by a
  bundled script: NHS numbers (Modulus-11 validated), UK National Insurance
  numbers, dates of birth, UK postcodes, phone numbers, emails, hospital/MRN
  numbers. (US SSN and ZIP codes are also handled.)
- **Layer 2 — reasoning (your judgement).** Identifiers that do not follow a fixed
  pattern: patient names, postal addresses and identifying ages. This is where you
  read context and tell a patient apart from the clinician treating them.

## Workflow

Copy this checklist and tick items off as you go:

```
Redaction progress:
- [ ] 1. Save the source text to a file
- [ ] 2. Run the pattern layer (scripts/redact_structured.py)
- [ ] 3. Apply the reasoning layer (names, addresses, ages)
- [ ] 4. Assemble the pseudonymised document (formatting preserved)
- [ ] 5. Self-check the output for residual identifiers
- [ ] 6. Write the redaction report
- [ ] 7. Add the limits note
```

### 1–2. Pattern layer

Write the user's text **verbatim** to a temp file, then run the script (execute
it — do not read it into context):

```bash
python3 scripts/redact_structured.py /tmp/redacta_input.txt
```

It prints JSON with `redacted_text`, `report` (count of distinct values per type)
and `token_map` (token → original value, for review and re-identification). Carry
`redacted_text` forward into Layer 2. The script uses the Python 3 standard
library only and makes no network calls.

### 3. Reasoning layer

Read `redacted_text` and pseudonymise what the patterns cannot:

- **Patient names** → `[PATIENT_NAME_n]`. Redact the patient and any relatives or
  carers named. **Keep** the names of treating clinicians, GPs, and institutions
  (hospital, ward, practice) by default — they carry meaning and are not the data
  subject. If the user asks for full de-identification, also redact those as
  `[CLINICIAN_NAME_n]` and `[ORG_NAME_n]`.
- **Postal addresses** → `[ADDRESS_n]`. Any postcode inside the address is already
  a token from Layer 1.
- **Identifying ages** → `[AGE_n]`. Redact specific ages ("a 73-year-old woman").
  Leave non-identifying bands ("elderly", "in her 70s") unless the user wants them
  removed.
- **Same value → same token.** Reuse a token for every occurrence of the same
  value; give different values new numbers. Continue numbering alongside the tokens
  already in `token_map`.
- **When unsure, redact.** Prefer removing a possible identifier over leaving it.

If the user asked for **HIPAA Safe Harbor** de-identification, also apply the
stricter rules in [Safe Harbor mode](#safe-harbor-mode-us-hipaa) at this step —
most importantly, redact *all* dates and ages, not just the date of birth.

See [reference.md](reference.md) for disambiguation heuristics and the full token
vocabulary.

### 4. Assemble

Reproduce the document exactly — same line breaks, headings and layout — changing
only the identifiers. Never alter clinical content (findings, medications, doses,
results, dates of appointments or procedures).

### 5. Self-check

Before finalising, re-read the assembled document as if you were an auditor and
look for anything that still identifies a person:

- Numbers that look like an NHS number, phone, MRN, account or reference but were
  not tokenised.
- A name, relative, carer or place name you passed over — especially mid-sentence
  ("…lives with her sister Joan…", "…transferred from St Elsewhere…").
- A specific age, postcode fragment, email, URL or date of birth.

If you find anything, tokenise it and update the report. A clean self-check is not
a guarantee — it is a second pass, not a proof. Treat it as the moment to catch
what Layers 1 and 2 missed.

### 6. Report

End with a short, human-readable report, for example:

> **Redaction report:** 5 identifiers pseudonymised — 1 patient name, 1 date of
> birth, 1 age, 1 NHS number, 1 address. Clinical content preserved.

If the user may need to reverse the process, also offer the token map as a table
(`token | original value`). Treat that table as the key that undoes the
pseudonymisation — include it only where the user wants it, and never alongside the
redacted text if the point was to keep identifiers separate.

### 7. Limits note

Always include this note:

> Redacta is a strong first line of defence, not a guarantee. It will not catch
> every possible identifier and is not a substitute for formal data-protection
> processes. Review the report before sharing the text.

## Re-identification (reversing the redaction)

When the user has run the redacted text through another tool and wants the real
values put back, use the token map with the bundled script (execute it — do not
read it into context):

```bash
python3 scripts/reinstate.py redacted_or_ai_output.txt --map token_map.json
```

`token_map.json` may be either a bare map (`{"[NHS_NUMBER_1]": "943 476 5919"}`)
or the full JSON object printed by `redact_structured.py` — both work. The script
swaps every token back to its original value and prints `{text, changed}`; add
`--text-only` for just the restored text. It is standard-library only and makes no
network calls.

This completes the round trip: **redact → process/share → re-identify**, with the
real identifiers only ever present locally. The token map is the key that reverses
the pseudonymisation — handle and store it with the same care as the original data.

## Safe Harbor mode (US HIPAA)

If the user asks for **HIPAA Safe Harbor** de-identification — or "US
de-identification", "Safe Harbor", or "remove all 18 HIPAA identifiers" — apply a
stricter pass on top of the normal workflow:

- **All dates, not just the date of birth.** Remove every date that relates to the
  individual — birth, admission, discharge, appointment, procedure, sample,
  death — as `[DATE_n]` (or `[DATE_OF_BIRTH_n]` for the DOB). This **overrides**
  the usual rule that keeps appointment and clinical dates. You may keep the bare
  year if the user asks, since Safe Harbor permits the year alone.
- **All specific ages** → `[AGE_n]`. Ages of 90 or older must be removed and
  aggregated (treat "92" and "almost 90" alike); do not leave a redactable age.
- **The remaining HIPAA identifier types** beyond what the pattern layer catches:
  fax numbers `[FAX_n]`, certificate/licence numbers `[LICENSE_n]`, device
  identifiers and serial numbers `[DEVICE_ID_n]`, vehicle identifiers / VINs
  `[VIN_n]`, health-plan beneficiary numbers `[HEALTH_PLAN_NUMBER_n]`, and any
  other unique identifying number, characteristic or code.
- Biometric identifiers and full-face photographs are out of scope for a text
  tool — flag them if referenced, but they cannot be removed from text alone.

Everything else (names, relatives, addresses, NHS/NI/SSN/MRN, emails, phones,
URLs, IP addresses, postcodes/ZIP) is already handled by the standard layers. Note
in the report that **Safe Harbor mode** was applied, and keep the limits note: the
Safe Harbor method still assumes no actual knowledge that the residual information
could re-identify the individual.

## Notes

- All processing happens in this session: the script makes no network calls and
  sends your text to no third-party service. Your text is of course visible to the
  assistant running this skill — the purpose of Redacta is to produce output that
  is safe to pass on to *other* tools, services or storage.
- Redacta is UK-focused (NHS, NI, UK postcodes) and also handles emails,
  international phone numbers, and US SSN/ZIP codes.
- The Modulus-11 algorithm, the date-of-birth vs clinical-date rule, NI prefix
  rules, the full token list and known limitations are documented in
  [reference.md](reference.md).

related skills

semantically similar in the cross-vendor index

clawhub

64% match

Log Pii Redactor

Detect and redact personally identifiable information (PII) in application logs to comply with GDPR, CCPA, HIPAA, and PCI DSS. Knows the realistic 2026 PII s...

don't have the plugin yet? install it then click "run inline in claude" again.

broke monolithic workflow into 7 explicit numbered procedure steps with clear inputs and outputs, separated decision logic (re-identification, hipaa safe harbor, scope defaults, ambiguity handling, edge cases like empty documents) into a dedicated section, added inputs list with environment/tooling details, formalized output contract with three distinct deliverables, and defined outcome signal as clinician-readable output plus verified token map round-trip.

Redacta

intent

pseudonymise medical text before it hits ai tools or gets shared around. redacta replaces patient identifiers (names, nhs numbers, dates of birth, addresses, phone numbers, etc.) with labelled tokens like [PATIENT_NAME_1] and [NHS_NUMBER_1] so the document stays clinically readable but strips out what could identify a person. use this when you need to redact a medical letter, discharge summary, clinical note, referral or patient record, or before pasting clinical text into another ai system. the skill works in two layers: a deterministic pattern matcher (nhs numbers, postcodes, emails, phone numbers) and a reasoning layer (patient names, addresses, ages) where you read context and make judgment calls. optional strict hipaa safe harbor mode handles us de-identification with all dates and ages removed.

inputs

source document: raw medical or clinical text (letter, note, summary, referral). copy and paste the full text as is, with all formatting and line breaks intact.
redaction scope: specify what you want redacted. default is standard uk pseudonymisation (patient identifiers, relatives, carers, addresses). optionally request:
- full de-identification (also redact clinician names and institution names)
- hipaa safe harbor mode (strict us de-identification per 45 cfr 164.514(b))
re-identification intent: if you have already run redacted text through another tool (ai, analysis, etc.) and want to restore original values, provide the redacted text and the original token_map.json from the first run.
local environment: python3 (standard library only, no external packages required). bundled scripts: scripts/redact_structured.py and scripts/reinstate.py. no network calls, no external api keys or oauth scopes needed.

procedure

save source text to temp file. write the user's document verbatim to a temporary file (e.g. /tmp/redacta_input.txt) with all original formatting, line breaks, headings and layout preserved. do not strip whitespace or normalize line endings.

input: raw clinical text from user. output: file on disk at /tmp/redacta_input.txt.
run the pattern layer script. execute the deterministic pattern matcher (do not read the script into context; execute it):
```
python3 scripts/redact_structured.py /tmp/redacta_input.txt
```
the script outputs a json object with three keys: redacted_text (the document after pattern layer), report (dict of token types and counts), and token_map (dict mapping tokens to original values). the script validates nhs numbers using modulus-11 and also matches uk national insurance numbers, dates of birth (dd/mm/yyyy or similar), uk postcodes, us zip codes, international phone numbers, emails, hospital/mrn numbers, and us ssn.

input: /tmp/redacta_input.txt. output: json object with redacted_text, report, token_map.
apply the reasoning layer. read the redacted_text output from step 2 and pseudonymise identifiers that do not follow fixed patterns:
- patient names and relatives/carers: replace with [PATIENT_NAME_n]. do not redact treating clinicians, gps, or institution names (hospital, ward, practice) by default, as they carry clinical meaning and are not the data subject. if the user requested full de-identification, also redact clinician names as [CLINICIAN_NAME_n] and institution names as [ORG_NAME_n].
- postal addresses: replace with [ADDRESS_n]. any postcode within the address is already tokenised from layer 1, so redact only the street/town/county parts if the postcode is missing.
- identifying ages: replace specific ages ("a 73-year-old woman", "age 51") with [AGE_n]. leave non-identifying age bands ("elderly", "in her 70s", "middle-aged") unless the user requested their removal.
- token reuse: if the same value appears multiple times (e.g. the patient's name appears 5 times), use the same token each time. assign new numbers only to new distinct values. continue numbering from the highest token already in the token_map from layer 1.
- ambiguity resolution: when unsure whether something is an identifier, redact it. prefer false-positive removal over leaving a possible identifier.
input: redacted_text and token_map from step 2; the user's original document (for reference). output: redacted_text with reasoning-layer replacements applied; updated token_map with new entries.
assemble the final pseudonymised document. reproduce the entire document exactly as it was (same line breaks, headings, paragraph structure, formatting) but with only the identifiers changed to tokens. do not alter clinical content: findings, medications, doses, results, appointment dates, procedure dates, lab values, any clinical meaning.

input: reasoning-layer redacted_text; original document (formatting reference). output: final pseudonymised document, formatted identically to the original.
self-check for residual identifiers. re-read the assembled document as an auditor looking for identifiers that layers 1 and 2 missed:
- numbers that look like an nhs number, phone, mrn, account reference but were not tokenised.
- a name, relative name, carer, or place name passed over (e.g. "lives with her sister Joan", "transferred from St Elsewhere Hospital").
- a specific age, postcode fragment, email, url, or date of birth not yet replaced.
- any other pattern that could narrow down identity (job title + location, rare condition + date).
if you find anything, tokenise it and update the token_map. a clean self-check is a second pass, not a proof; it catches what patterns and reasoning layers missed.

input: final pseudonymised document. output: corrected pseudonymised document (if residual identifiers found); updated token_map.
write the redaction report. produce a short, human-readable summary of what was redacted, e.g.:

Redaction report: 7 identifiers pseudonymised , 2 patient names, 1 date of birth, 2 ages, 1 nhs number, 1 address. clinical content preserved.

if the user may need to reverse the redaction (re-identify), also offer the token_map as a table with columns token | original value. treat the token map as a key that undoes pseudonymisation: include it only if the user explicitly wants it, and never display it alongside the redacted text if the intent was to keep identifiers separate.

if hipaa safe harbor mode was applied (see decision points), note that in the report: "hipaa safe harbor mode applied."

input: token_map; tally of redacted identifiers by type. output: human-readable report; optionally, token map as a table.
add the limits note. end your response with this disclaimer:

Redacta is a strong first line of defence, not a guarantee. it will not catch every possible identifier and is not a substitute for formal data-protection processes. review the report before sharing the text.

input: (none). output: limits note appended to response.

decision points

if user requests re-identification (reverse the redaction): do not redo layers 1 and 2. instead, execute the reinstate script (do not read it into context):
```
python3 scripts/reinstate.py <redacted_text_file> --map <token_map.json>
```
the script accepts token_map.json in either bare-map form ({"[NHS_NUMBER_1]": "943 476 5919"}) or the full json object from redact_structured.py. it swaps every token back to its original value. use --text-only flag to print only the restored text. proceed to step 6 (report) and step 7 (limits note) to document the round trip.
if user requests hipaa safe harbor mode (strict us de-identification per 45 cfr 164.514(b)): apply stricter rules on top of layers 1 and 2 during step 3 (reasoning layer):
- redact all dates that relate to the individual (birth, admission, discharge, appointment, procedure, sample, death, lab date) as [DATE_n] (or [DATE_OF_BIRTH_n] for dob). this overrides the normal rule that preserves appointment and procedure dates. the user may request the bare year only, which safe harbor permits; if so, keep the year and redact the month and day as [DATE_n].
- redact all specific ages (including ages 90 or older; treat "92", "ninety-two", "almost 90" the same way) as [AGE_n]. do not leave any redactable age.
- redact the remaining hipaa identifier types: fax numbers as [FAX_n], certificate/license numbers as [LICENSE_n], device serial numbers as [DEVICE_ID_n], vehicle identifiers/vins as [VIN_n], health-plan beneficiary numbers as [HEALTH_PLAN_NUMBER_n], and any other unique identifying number, code or characteristic not covered by layer 1.
- biometric identifiers and full-face photographs are out of scope for a text tool; flag them in the report but note they cannot be removed from text alone.
- note in the report (step 6) that hipaa safe harbor mode was applied.
if user does not specify redaction scope: assume standard uk pseudonymisation (patient, relatives, carers, addresses). keep clinician names and institution names by default.
if the same identifier appears multiple times: reuse the same token for every occurrence. do not assign a new number for each instance.
if an identifier is ambiguous or borderline (e.g. "Mrs X", could be patient or clinician): redact it. prefer false-positive removal.
if the document is empty or contains no identifiable content: report "0 identifiers found" and deliver the original text unchanged. include the limits note and explain that the document contains no protected identifiers.

output contract

the skill delivers three outputs:

final pseudonymised document: the original document reproduced exactly (formatting, line breaks, headings, layout unchanged) with patient identifiers replaced by tokens. clinical content (findings, medications, doses, results, dates of appointments/procedures, lab values) is preserved unchanged. the document is safe to share or pass to other tools.
redaction report: a short human-readable summary of the number and types of identifiers redacted (e.g. "7 identifiers pseudonymised , 2 patient names, 1 date of birth, 2 ages, 1 nhs number, 1 address"). if hipaa safe harbor mode was applied, this is noted. if residual identifiers were found during self-check, they are listed.
token map (optional, only if user requests): a json object or a table (token | original value) mapping each token back to its original value. this is the key that reverses pseudonymisation and is provided only if the user explicitly asks for it or indicates future re-identification is needed. it must be stored and handled with the same care as the original data.

all outputs are text-based and produced in this session. no files are uploaded, no network calls are made, no data is sent to external services.

outcome signal

redacta has worked when:

the pseudonymised document reads clinically intact: a clinician could understand the findings, medications, results and clinical story without any loss of meaning.
no patient names, relatives, carers, addresses, phone numbers, nhs numbers, dates of birth, or other personal identifiers appear as plain text in the final document. all are replaced by labelled tokens.
the redaction report accurately lists the identifiers removed. you can verify the count and types by spot-checking the document and the token map.
the token map (if provided) correctly reverses the pseudonymisation: running the reinstate script with the token map restores the original document exactly, including all formatting.
the self-check pass (step 5) finds no residual identifiers that could narrow down or re-identify a person.
the user can confidently share or process the redacted document without risk of exposing patient identifiers.

the limits note is present and visible, reminding the user that redacta is a first line of defence, not a guarantee.