Parse PDF documents with MinerU MCP to extract text, tables, and formulas. Supports multiple backends including MLX-accelerated inference on Apple Silicon.
SKILL.md

---
name: mineru-pdf
description: Parse PDF documents with MinerU MCP to extract text, tables, and formulas. Supports multiple backends including MLX-accelerated inference on Apple Silicon.
homepage: https://github.com/TINKPA/mcp-mineru
metadata:
  {
    "openclaw":
      {
        "emoji": "📄",
        "requires": { "bins": ["uvx"] },
        "install":
          [
            {
              "id": "uvx",
              "kind": "uvx",
              "package": "mcp-mineru",
              "label": "Install mcp-mineru via uvx (auto-managed)",
            },
          ],
      },
  }
---

# MinerU PDF Parser

Parse PDF documents using MinerU MCP to extract structured content including text, tables, and formulas with MLX acceleration on Apple Silicon.

## Installation

### Option 1: Install MinerU MCP (for Claude Code)

```bash
claude mcp add --transport stdio --scope user mineru -- \
  uvx --from mcp-mineru python -m mcp_mineru.server
```

This installs and configures MinerU for all Claude projects. Models are downloaded on first use.

### Option 2: Use Direct Tool (preserves files)

The skill includes a direct parsing tool that saves output to a persistent directory:

```bash
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py <pdf_path> <output_dir> [options]
```

**Advantages:**
- ✅ Files are saved permanently (not auto-deleted)
- ✅ Full control over output location
- ✅ No MCP overhead
- ✅ Works with any Python environment that has MinerU

## Quick Start

### Method 1: Using the Direct Tool (Recommended)

```bash
# Parse entire PDF
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
  "/path/to/document.pdf" \
  "/path/to/output"

# Parse specific pages
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
  "/path/to/document.pdf" \
  "/path/to/output" \
  --start-page 0 --end-page 2

# Use Apple Silicon optimization
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
  "/path/to/document.pdf" \
  "/path/to/output" \
  --backend vlm-mlx-engine

# Text only (faster)
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
  "/path/to/document.pdf" \
  "/path/to/output" \
  --no-table --no-formula
```

### Method 2: Using MinerU MCP (Temporary Files)

### Parse a PDF document

```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool

async def parse_pdf():
    result = await call_tool(
        name='parse_pdf',
        arguments={
            'file_path': '/path/to/document.pdf',
            'backend': 'pipeline',
            'formula_enable': True,
            'table_enable': True,
            'start_page': 0,
            'end_page': -1  # -1 for all pages
        }
    )
    if hasattr(result, 'content'):
        for item in result.content:
            if hasattr(item, 'text'):
                print(item.text)
                break

asyncio.run(parse_pdf())
"
```

### Check system capabilities

```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool

async def list_backends():
    result = await call_tool(
        name='list_backends',
        arguments={}
    )
    if hasattr(result, 'content'):
        for item in result.content:
            if hasattr(item, 'text'):
                print(item.text)
                break

asyncio.run(list_backends())
"
```

## Parameters

### parse_pdf

**Required:**
- `file_path` - Absolute path to the PDF file

**Optional:**
- `backend` - Processing backend (default: `pipeline`)
  - `pipeline` - Fast, general-purpose (recommended)
  - `vlm-mlx-engine` - Fastest on Apple Silicon (M1/M2/M3/M4)
  - `vlm-transformers` - Slowest but most accurate
- `formula_enable` - Enable formula recognition (default: `true`)
- `table_enable` - Enable table recognition (default: `true`)
- `start_page` - Starting page (0-indexed, default: `0`)
- `end_page` - Ending page (default: `-1` for all pages)

### list_backends

No parameters required. Returns system information and backend recommendations.

## Usage Examples

### Extract tables from a specific page range

```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool

async def parse_pdf():
    result = await call_tool(
        name='parse_pdf',
        arguments={
            'file_path': '/path/to/document.pdf',
            'backend': 'pipeline',
            'table_enable': True,
            'start_page': 5,
            'end_page': 10
        }
    )
    if hasattr(result, 'content'):
        for item in result.content:
            if hasattr(item, 'text'):
                print(item.text)
                break

asyncio.run(parse_pdf())
"
```

### Parse with formula recognition only (faster)

```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool

async def parse_pdf():
    result = await call_tool(
        name='parse_pdf',
        arguments={
            'file_path': '/path/to/document.pdf',
            'backend': 'vlm-mlx-engine',
            'formula_enable': True,
            'table_enable': False  # Disable for speed
        }
    )
    if hasattr(result, 'content'):
        for item in result.content:
            if hasattr(item, 'text'):
                print(item.text)
                break

asyncio.run(parse_pdf())
"
```

### Parse single page (fastest for testing)

```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool

async def parse_pdf():
    result = await call_tool(
        name='parse_pdf',
        arguments={
            'file_path': '/path/to/document.pdf',
            'backend': 'pipeline',
            'formula_enable': False,
            'table_enable': False,
            'start_page': 0,
            'end_page': 0
        }
    )
    if hasattr(result, 'content'):
        for item in result.content:
            if hasattr(item, 'text'):
                print(item.text)
                break

asyncio.run(parse_pdf())
"
```

## Performance

On Apple Silicon M4 (16GB RAM):
- `pipeline`: ~32s/page, CPU-only, good quality
- `vlm-mlx-engine`: ~38s/page, Apple Silicon optimized, excellent quality
- `vlm-transformers`: ~148s/page, highest quality, slowest

**Note:** First run downloads models (can take 5-10 minutes). Models are cached in `~/.cache/uv/` for faster subsequent runs.

## Output Format

Returns structured Markdown with:
- Document metadata (file, backend, pages, settings)
- Extracted text with preserved structure
- Tables formatted as Markdown tables
- Formulas converted to LaTeX

## Supported Formats

- PDF documents (`.pdf`)
- JPEG images (`.jpg`, `.jpeg`)
- PNG images (`.png`)
- Other image formats (WebP, GIF, etc.)

## Troubleshooting

### Module not found error

If you get "No module named 'mcp_mineru'", make sure you installed it:

```bash
claude mcp add --transport stdio --scope user mineru -- \
  uvx --from mcp-mineru python -m mcp_mineru.server
```

### Slow processing on first run

This is normal. MinerU downloads ML models on first use. Subsequent runs will be much faster.

### Timeout errors

Increase timeout for large documents or use smaller page ranges for testing.

## Notes

- Output is returned as Markdown text
- Tables are preserved in Markdown format
- Mathematical formulas are converted to LaTeX
- Works with scanned documents (OCR built-in)
- Optimized for Apple Silicon (M1/M2/M3/M4) with MLX backend

## File Persistence

### Why Files Get Deleted (MCP Method)

The MinerU MCP server uses Python's `tempfile.TemporaryDirectory()`, which automatically deletes files when the context exits. This is by design to prevent temporary files from accumulating.

### How to Preserve Files

**Method A: Use the Direct Tool (Recommended)**

The skill provides `parse.py` which saves files to a persistent directory:

```bash
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
  /path/to/input.pdf \
  /path/to/output_dir
```

**Advantages:**
- ✅ Files are never auto-deleted
- ✅ Full control over output location
- ✅ Can be used in batch processing
- ✅ No MCP connection needed

**Generated Structure:**
```
/path/to/output_dir/
├── input.pdf_name/
│   └── auto/          # or vlm/ depending on backend
│       ├── input.pdf_name.md
│       └── images/
│           └── *.jpg
└── input.pdf_name_parsed.md  # Copy at root for easy access
```

**Method B: Redirect MCP Output**

If using the MCP method, capture the output and save it:

```bash
# Capture to file
claude -p "Parse this PDF: /path/to/file.pdf" > /tmp/output.md

# Or use within a script that saves the result
```

### Comparison

| Feature | Direct Tool | MCP Method |
|----------|-------------|-------------|
| Files persisted | ✅ Yes | ❌ No (auto-deleted) |
| Custom output dir | ✅ Yes | ❌ No (temp only) |
| Claude Code integration | ⚠️ Manual | ✅ Native |
| Speed | ✅ Fast | ⚠️ MCP overhead |
| Offline use | ✅ Yes | ⚠️ Needs Claude Code |

### Recommendation

- **Use Direct Tool** when you need to keep the files for later use
- **Use MCP Method** when working within Claude Code and only need the text content
Mineru Pdf

SKILL.md

related skills