Extract text and LaTeX formulas from academic PDFs in English and Chinese, outputting structured Markdown with math, tables, and images preserved.
# PDF Extractor Skill
Extract text and mathematical formulas from academic PDF papers. Supports both English and Chinese content.
## When to Use This Skill
Use this skill when:
- User needs to extract text and LaTeX formulas from PDF papers
- User mentions "PDF转文本", "PDF提取公式", "论文OCR"
- User wants to convert academic papers to Markdown format
## Tool Selection
| Tool | Best For | Languages | Math Quality |
|------|----------|-----------|--------------|
| **Marker** (推荐) | 中英文论文、复杂公式 | Chinese + English | Excellent |
| **Nougat** | 纯英文论文、arXiv | English only | Excellent |
**推荐使用 Marker**:支持中英文混排,公式识别效果更好。
---
## Environment Setup
**Conda Environment**: `pdf-extractor`
**Python Path**: `D:\anaconda3\envs\pdf-extractor\python.exe`
### Key Dependencies
- PyTorch 2.10.0+cu128 (CUDA 12.8)
- marker-pdf (Surya OCR + Texify)
- nougat-ocr 0.1.17
- transformers
## Important: Keep This Skill Self-Contained (No Extra Installs)
This skill is expected to run using ONLY the existing `pdf-extractor` conda environment and the scripts in `scripts/`.
Rules:
- Do NOT run `pip install ...` / `conda install ...` / download random libraries during extraction.
- If a dependency is missing (e.g., Nougat crashes due to missing `torchvision`), do NOT try to fix by installing packages. Switch tools (prefer Marker) or report the environment issue.
- Slow runtime is normal for Marker (especially with `--ark-code-latest`). Prefer splitting the PDF rather than changing tools or adding dependencies.
Recommended approach for long PDFs:
- Use `--page-range` (0-based) to extract per page or small page batches.
- Merge the resulting markdown files afterward (simple concatenation is fine). Keep the combined file in the same folder as the per-page outputs so image links remain valid.
Example (per-page extraction with LLM mode):
```bash
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest --page-range "0" -o "out/page_01.md"
```
---
## Tool 1: Marker (推荐 - 中英文支持)
### Command Line
```bash
# 转换中文论文 (默认支持中英文)
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "论文.pdf"
# 指定输出路径
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" -o "output.md"
# 强制 OCR (用于扫描版 PDF)
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "scanned.pdf" --force-ocr
# 使用火山方舟 Coding Plan (OpenAI-compatible) 增强转换质量(表格/公式/跨页结构更稳)
# 注意:默认走 ark-code-latest,后台会自动路由到合适的模型
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest
# 只跑第 1 页做快速验证(0-based page index)
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest --page-range "0" -o "out_first_page.md"
# 如需自定义(不推荐):也可以手动指定 --openai-base-url/--openai-api-key/--openai-model
# 指定语言
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2md_marker.py "paper.pdf" --languages Chinese English Japanese
```
### Python API
```python
import sys
sys.path.insert(0, r'C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts')
from pdf2md_marker import convert_pdf, convert_pdf_cli
# 简单用法
output_file = convert_pdf_cli('论文.pdf', 'output.md')
# 完整 API
markdown_text, metadata = convert_pdf(
'paper.pdf',
output_dir='./output',
force_ocr=False,
batch_multiplier=2,
languages=['Chinese', 'English']
)
print(markdown_text)
```
### Marker Options
| Option | Description |
|--------|-------------|
| `-o, --output` | Output file (.md) or directory |
| `--force-ocr` | Force OCR even for text PDFs |
| `--batch-multiplier` | Batch size multiplier (default: 2) |
| `--languages` | Languages in document (default: Chinese English) |
---
## Tool 2: Nougat (纯英文论文)
### Command Line
```bash
# Convert entire PDF
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2latex.py "paper.pdf"
# Convert specific pages
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2latex.py "paper.pdf" -p 0-5
# Custom output
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2latex.py "paper.pdf" -o output.mmd
# Save each page separately
D:\anaconda3\envs\pdf-extractor\python.exe C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts\pdf2latex.py "paper.pdf" --per-page
```
### Python API
```python
import sys
sys.path.insert(0, r'C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts')
from pdf2latex import load_model, process_pdf, save_results
# Load model (uses GPU if available)
model, device = load_model()
# Process PDF
results = process_pdf('paper.pdf', model, device)
# Save as single markdown file
save_results(results, 'output.mmd')
# Or save per page
save_results(results, 'output_pages/', format='pages')
```
### Nougat Options
| Option | Description |
|--------|-------------|
| `-o, --output` | Output file or directory |
| `-p, --pages` | Page range (e.g., "0-5" or "1,3,5") |
| `-m, --model` | Model tag (default: 0.1.0-base) |
| `--dpi` | Render DPI (default: 300) |
| `--cpu` | Force CPU mode |
| `--per-page` | Save each page separately |
---
## Output Format
Both tools output Markdown with LaTeX math:
- Text is extracted as regular markdown
- Mathematical formulas are in LaTeX format:
- Inline: `$formula$`
- Display: `$$formula$$`
- Tables, figures, and references are preserved
- Marker also extracts images to separate folder
---
## Comparison
| Feature | Marker | Nougat |
|---------|--------|--------|
| Chinese Support | ✓ Excellent | ✗ Poor |
| English Support | ✓ Excellent | ✓ Excellent |
| Math Formulas | ✓ (Texify) | ✓ (Native) |
| Table Extraction | ✓ | ✓ |
| Image Extraction | ✓ | ✗ |
| Speed (RTX 4060) | ~2 min/page | ~10-15 sec/page |
| OCR Quality | Excellent | Good |
---
## Troubleshooting
### Import Errors
Make sure you're using the correct Python:
```bash
D:\anaconda3\envs\pdf-extractor\python.exe your_script.py
```
### CUDA Out of Memory
Try CPU mode (Nougat) or reduce batch size (Marker):
```bash
# Nougat: use CPU
D:\anaconda3\envs\pdf-extractor\python.exe pdf2latex.py paper.pdf --cpu
# Marker: reduce batch multiplier
D:\anaconda3\envs\pdf-extractor\python.exe pdf2md_marker.py paper.pdf --batch-multiplier 1
```
### Chinese Characters Not Recognized
Use Marker instead of Nougat for Chinese documents.
### Slow Processing
- Marker is slower but more accurate (uses multiple ML models)
- For faster processing on English-only papers, use Nougat
- Ensure GPU is being used (check CUDA availability)
---
## Model Information
**Marker Models** (downloaded automatically):
- Surya OCR: Text detection and recognition
- Texify: Math formula recognition
- Layout analysis models
**Nougat Base Model** (1.31 GB):
- Location: `C:\Users\cr\.cache\torch\hub\nougat-0.1.0-base`
- Best for: Standard academic papers, arXiv papers
---
## Example Workflow
```python
import sys
sys.path.insert(0, r'C:\Users\cr\.config\opencode\skills\pdf-extractor\scripts')
def extract_paper(pdf_path, is_chinese=True):
"""
Extract text and formulas from academic paper.
Args:
pdf_path: Path to PDF file
is_chinese: True for Chinese papers, False for English only
Returns:
Extracted markdown text
"""
if is_chinese:
from pdf2md_marker import convert_pdf
text, _ = convert_pdf(pdf_path, languages=['Chinese', 'English'])
else:
from pdf2latex import load_model, process_pdf
model, device = load_model()
results = process_pdf(pdf_path, model, device)
text = '\n\n'.join([t for _, t in results])
return text
# Usage
text = extract_paper('中文论文.pdf', is_chinese=True)
print(text)
```
don't have the plugin yet? install it then click "run inline in claude" again.