Comprehensively extract text, tables, images, and metadata from PDF and Word (.docx) documents. Use when a user shares a document and wants it parsed, analyz...
---
name: doc-extract
description: Comprehensively extract text, tables, images, and metadata from PDF and Word (.docx) documents. Use when a user shares a document and wants it parsed, analyzed, summarized, or processed. Handles native PDFs, scanned PDFs (via OCR), and Word documents.
---
# doc-extract
Extracts everything useful from PDF and Word documents so you can process, summarize, or act on the content.
## When to Use
- User shares or references a PDF or DOCX file
- Need to read, summarize, or analyze a document
- Extracting tables for data processing
- Pulling images out of a document
- Checking document metadata (author, date, title)
- OCR on scanned/image-based PDFs
## Requirements
- Python venv: `~/.openclaw/venvs/doctools`
- Libraries: `pymupdf`, `pdfplumber`, `python-docx` (pre-installed)
- CLI tools: `tesseract` (OCR), `pandoc` (optional), `pdftotext` (from poppler)
- All installed at: `/opt/homebrew/bin/`
## Script
`scripts/extract.py` — the main extractor. Always use the venv:
```bash
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py <file> [options]
```
## Options
| Flag | Description |
|------|-------------|
| `--format markdown` | Human-readable output (default) |
| `--format json` | Structured JSON for programmatic use |
| `--output-dir <dir>` | Where to save extracted images (default: `<filename>_extracted/` next to file) |
| `--ocr` | Run Tesseract OCR on pages with no extractable text (scanned docs) |
## Common Workflows
### Extract and read a PDF
```bash
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/report.pdf
```
### Extract a Word doc
```bash
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/contract.docx
```
### Scanned PDF (OCR mode)
```bash
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/scan.pdf --ocr
```
### Get structured JSON output
```bash
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/data.pdf --format json
```
### Save images to a specific folder
```bash
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/report.pdf --output-dir ~/Desktop/report_images
```
## Output Structure
**Markdown mode** gives you:
- Document metadata (title, author, date)
- Summary (page count, table count, image count)
- Full extracted text
- Tables rendered as Markdown tables
- Image paths for extracted images
- Any warnings/errors
**JSON mode** gives you:
- `type`: pdf or docx
- `file`: source path
- `metadata`: document properties
- `text`: full concatenated text
- `pages_text`: per-page text (PDF only)
- `tables`: array of `{page, index, rows, cols, data[][]}`
- `images`: array of `{page, index, path, format, width, height}`
- `errors`: any non-fatal issues encountered
## Extraction Strategy
1. **PDF native text** → pymupdf (fast, accurate for digital PDFs)
2. **PDF tables** → pdfplumber (best-in-class table detection)
3. **PDF images** → pymupdf image extraction
4. **PDF scanned pages** → tesseract OCR (only triggered with `--ocr` or when page has no text)
5. **DOCX text** → python-docx (preserves paragraph styles/headings)
6. **DOCX tables** → python-docx table parser
7. **DOCX images** → extracted from document relationships
## Model
Always use **claude-opus-4-6 (opus)** when analyzing extracted content — summarizing, answering questions, processing tables, etc.
## Tips
- For large documents, use `--format json` and parse programmatically
- Images are saved to disk — reference the paths in follow-up tasks
- Tables come out as 2D arrays — easy to convert to CSV or analyze
- For password-protected PDFs, unlock first with: `qpdf --decrypt input.pdf output.pdf`
- For `.doc` (old Word format), convert first: `pandoc input.doc -o output.docx`
don't have the plugin yet? install it then click "run inline in claude" again.