Comprehensively extract text, tables, images, and metadata from PDF and Word (.docx) documents. Use when a user shares a document and wants it parsed, analyz...

SKILL.md

---
name: doc-extract
description: Comprehensively extract text, tables, images, and metadata from PDF and Word (.docx) documents. Use when a user shares a document and wants it parsed, analyzed, summarized, or processed. Handles native PDFs, scanned PDFs (via OCR), and Word documents.
---

# doc-extract

Extracts everything useful from PDF and Word documents so you can process, summarize, or act on the content.

## When to Use

- User shares or references a PDF or DOCX file
- Need to read, summarize, or analyze a document
- Extracting tables for data processing
- Pulling images out of a document
- Checking document metadata (author, date, title)
- OCR on scanned/image-based PDFs

## Requirements

- Python venv: `~/.openclaw/venvs/doctools`
- Libraries: `pymupdf`, `pdfplumber`, `python-docx` (pre-installed)
- CLI tools: `tesseract` (OCR), `pandoc` (optional), `pdftotext` (from poppler)
- All installed at: `/opt/homebrew/bin/`

## Script

`scripts/extract.py` — the main extractor. Always use the venv:

```bash
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py <file> [options]
```

## Options

| Flag | Description |
|------|-------------|
| `--format markdown` | Human-readable output (default) |
| `--format json` | Structured JSON for programmatic use |
| `--output-dir <dir>` | Where to save extracted images (default: `<filename>_extracted/` next to file) |
| `--ocr` | Run Tesseract OCR on pages with no extractable text (scanned docs) |

## Common Workflows

### Extract and read a PDF
```bash
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/report.pdf
```

### Extract a Word doc
```bash
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/contract.docx
```

### Scanned PDF (OCR mode)
```bash
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/scan.pdf --ocr
```

### Get structured JSON output
```bash
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/data.pdf --format json
```

### Save images to a specific folder
```bash
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/report.pdf --output-dir ~/Desktop/report_images
```

## Output Structure

**Markdown mode** gives you:
- Document metadata (title, author, date)
- Summary (page count, table count, image count)
- Full extracted text
- Tables rendered as Markdown tables
- Image paths for extracted images
- Any warnings/errors

**JSON mode** gives you:
- `type`: pdf or docx
- `file`: source path
- `metadata`: document properties
- `text`: full concatenated text
- `pages_text`: per-page text (PDF only)
- `tables`: array of `{page, index, rows, cols, data[][]}`
- `images`: array of `{page, index, path, format, width, height}`
- `errors`: any non-fatal issues encountered

## Extraction Strategy

1. **PDF native text** → pymupdf (fast, accurate for digital PDFs)
2. **PDF tables** → pdfplumber (best-in-class table detection)
3. **PDF images** → pymupdf image extraction
4. **PDF scanned pages** → tesseract OCR (only triggered with `--ocr` or when page has no text)
5. **DOCX text** → python-docx (preserves paragraph styles/headings)
6. **DOCX tables** → python-docx table parser
7. **DOCX images** → extracted from document relationships

## Model

Always use **claude-opus-4-6 (opus)** when analyzing extracted content — summarizing, answering questions, processing tables, etc.

## Tips

- For large documents, use `--format json` and parse programmatically
- Images are saved to disk — reference the paths in follow-up tasks
- Tables come out as 2D arrays — easy to convert to CSV or analyze
- For password-protected PDFs, unlock first with: `qpdf --decrypt input.pdf output.pdf`
- For `.doc` (old Word format), convert first: `pandoc input.doc -o output.docx`

Toji Doc Extractor

SKILL.md

related skills