Parse PDF documents with MinerU MCP to extract text, tables, and formulas. Supports multiple backends including MLX-accelerated inference on Apple Silicon.
---
name: mineru-pdf
description: Parse PDF documents with MinerU MCP to extract text, tables, and formulas. Supports multiple backends including MLX-accelerated inference on Apple Silicon.
homepage: https://github.com/TINKPA/mcp-mineru
metadata:
{
"openclaw":
{
"emoji": "π",
"requires": { "bins": ["uvx"] },
"install":
[
{
"id": "uvx",
"kind": "uvx",
"package": "mcp-mineru",
"label": "Install mcp-mineru via uvx (auto-managed)",
},
],
},
}
---
# MinerU PDF Parser
Parse PDF documents using MinerU MCP to extract structured content including text, tables, and formulas with MLX acceleration on Apple Silicon.
## Installation
### Option 1: Install MinerU MCP (for Claude Code)
```bash
claude mcp add --transport stdio --scope user mineru -- \
uvx --from mcp-mineru python -m mcp_mineru.server
```
This installs and configures MinerU for all Claude projects. Models are downloaded on first use.
### Option 2: Use Direct Tool (preserves files)
The skill includes a direct parsing tool that saves output to a persistent directory:
```bash
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py <pdf_path> <output_dir> [options]
```
**Advantages:**
- β
Files are saved permanently (not auto-deleted)
- β
Full control over output location
- β
No MCP overhead
- β
Works with any Python environment that has MinerU
## Quick Start
### Method 1: Using the Direct Tool (Recommended)
```bash
# Parse entire PDF
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
"/path/to/document.pdf" \
"/path/to/output"
# Parse specific pages
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
"/path/to/document.pdf" \
"/path/to/output" \
--start-page 0 --end-page 2
# Use Apple Silicon optimization
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
"/path/to/document.pdf" \
"/path/to/output" \
--backend vlm-mlx-engine
# Text only (faster)
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
"/path/to/document.pdf" \
"/path/to/output" \
--no-table --no-formula
```
### Method 2: Using MinerU MCP (Temporary Files)
### Parse a PDF document
```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'pipeline',
'formula_enable': True,
'table_enable': True,
'start_page': 0,
'end_page': -1 # -1 for all pages
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
```
### Check system capabilities
```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def list_backends():
result = await call_tool(
name='list_backends',
arguments={}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(list_backends())
"
```
## Parameters
### parse_pdf
**Required:**
- `file_path` - Absolute path to the PDF file
**Optional:**
- `backend` - Processing backend (default: `pipeline`)
- `pipeline` - Fast, general-purpose (recommended)
- `vlm-mlx-engine` - Fastest on Apple Silicon (M1/M2/M3/M4)
- `vlm-transformers` - Slowest but most accurate
- `formula_enable` - Enable formula recognition (default: `true`)
- `table_enable` - Enable table recognition (default: `true`)
- `start_page` - Starting page (0-indexed, default: `0`)
- `end_page` - Ending page (default: `-1` for all pages)
### list_backends
No parameters required. Returns system information and backend recommendations.
## Usage Examples
### Extract tables from a specific page range
```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'pipeline',
'table_enable': True,
'start_page': 5,
'end_page': 10
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
```
### Parse with formula recognition only (faster)
```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'vlm-mlx-engine',
'formula_enable': True,
'table_enable': False # Disable for speed
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
```
### Parse single page (fastest for testing)
```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'pipeline',
'formula_enable': False,
'table_enable': False,
'start_page': 0,
'end_page': 0
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
```
## Performance
On Apple Silicon M4 (16GB RAM):
- `pipeline`: ~32s/page, CPU-only, good quality
- `vlm-mlx-engine`: ~38s/page, Apple Silicon optimized, excellent quality
- `vlm-transformers`: ~148s/page, highest quality, slowest
**Note:** First run downloads models (can take 5-10 minutes). Models are cached in `~/.cache/uv/` for faster subsequent runs.
## Output Format
Returns structured Markdown with:
- Document metadata (file, backend, pages, settings)
- Extracted text with preserved structure
- Tables formatted as Markdown tables
- Formulas converted to LaTeX
## Supported Formats
- PDF documents (`.pdf`)
- JPEG images (`.jpg`, `.jpeg`)
- PNG images (`.png`)
- Other image formats (WebP, GIF, etc.)
## Troubleshooting
### Module not found error
If you get "No module named 'mcp_mineru'", make sure you installed it:
```bash
claude mcp add --transport stdio --scope user mineru -- \
uvx --from mcp-mineru python -m mcp_mineru.server
```
### Slow processing on first run
This is normal. MinerU downloads ML models on first use. Subsequent runs will be much faster.
### Timeout errors
Increase timeout for large documents or use smaller page ranges for testing.
## Notes
- Output is returned as Markdown text
- Tables are preserved in Markdown format
- Mathematical formulas are converted to LaTeX
- Works with scanned documents (OCR built-in)
- Optimized for Apple Silicon (M1/M2/M3/M4) with MLX backend
## File Persistence
### Why Files Get Deleted (MCP Method)
The MinerU MCP server uses Python's `tempfile.TemporaryDirectory()`, which automatically deletes files when the context exits. This is by design to prevent temporary files from accumulating.
### How to Preserve Files
**Method A: Use the Direct Tool (Recommended)**
The skill provides `parse.py` which saves files to a persistent directory:
```bash
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
/path/to/input.pdf \
/path/to/output_dir
```
**Advantages:**
- β
Files are never auto-deleted
- β
Full control over output location
- β
Can be used in batch processing
- β
No MCP connection needed
**Generated Structure:**
```
/path/to/output_dir/
βββ input.pdf_name/
β βββ auto/ # or vlm/ depending on backend
β βββ input.pdf_name.md
β βββ images/
β βββ *.jpg
βββ input.pdf_name_parsed.md # Copy at root for easy access
```
**Method B: Redirect MCP Output**
If using the MCP method, capture the output and save it:
```bash
# Capture to file
claude -p "Parse this PDF: /path/to/file.pdf" > /tmp/output.md
# Or use within a script that saves the result
```
### Comparison
| Feature | Direct Tool | MCP Method |
|----------|-------------|-------------|
| Files persisted | β
Yes | β No (auto-deleted) |
| Custom output dir | β
Yes | β No (temp only) |
| Claude Code integration | β οΈ Manual | β
Native |
| Speed | β
Fast | β οΈ MCP overhead |
| Offline use | β
Yes | β οΈ Needs Claude Code |
### Recommendation
- **Use Direct Tool** when you need to keep the files for later use
- **Use MCP Method** when working within Claude Code and only need the text content
don't have the plugin yet? install it then click "run inline in claude" again.