PDF Field Extractor — AI-powered PDF structured data extraction. Extract key fields from PDF into Excel/JSON. Supports: invoice, contract, receipt, bank stat...
---
name: pdf-extractor
description: "PDF Field Extractor — AI-powered PDF structured data extraction. Extract key fields from PDF into Excel/JSON. Supports: invoice, contract, receipt, bank statement, license, ID card, express waybill, generic document. Triggers: PDF extraction, PDF field extraction, PDF to Excel, PDF to JSON, invoice extraction, contract extraction, document recognition, batch PDF processing, field extraction."
override-tools: []
---
# PDF Field Extractor
AI-powered PDF structured data extraction — convert PDF key fields into Excel/JSON.
## End-to-End Flow
User uploads PDF → Document type identification → AI field extraction → Structured output (Excel/JSON)
```python
from scripts.pdf_extractor import extract_pdf_text
from scripts.field_extractor import extract_fields
from scripts.output_generator import generate_excel, generate_json
# Step 1: Extract PDF text (PyMuPDF + pdfplumber)
text, tables, images = extract_pdf_text("invoice.pdf")
# Step 2: AI field extraction (user provides own API Key, OpenAI-compatible)
fields = extract_fields(
text=text,
doc_type="invoice",
api_key="sk-xxx",
api_base="https://api.openai.com/v1",
model="gpt-4o",
)
```
## Supported Document Types
| Type | Description |
|------|-------------|
| Invoice | VAT invoice, receipt invoice, electronic invoice |
| Contract | Contracts, agreements |
| Receipt | Receipts, tickets |
| Bank Statement | Bank reconciliation statements |
| License | Business license |
| ID Card | ID card, passport |
| Express | Waybill, shipping label |
| Generic | User-defined custom extraction |
## Detection Modes
| Mode | Description |
|------|-------------|
| Auto | AI automatically identifies document type |
| Manual | User specifies document type |
## Tiered Features
| Feature | FREE | PRO |
|---------|:----:|:---:|
| Monthly pages | 10 | Unlimited |
| Document types | Invoice only | All types |
| Output formats | Text | Excel + JSON + Text |
| OCR languages | English | English + Chinese + 9 more |
| Batch processing | 1 page | Unlimited |
| Custom fields | — | Yes |
| Price | Free | $0.01/call |
---
## Technical Implementation
- **PDF parsing**: PyMuPDF (fitz) + pdfplumber for text and table extraction
- **OCR**: EasyOCR / Tesseract for scanned documents (multi-language support)
- **AI extraction**: OpenAI-compatible API, model-agnostic (GPT-4o, DeepSeek, GLM, etc.)
- **Output**: Excel (.xlsx) with formatted sheets, JSON with structured hierarchy
## Output Format
### Excel Output
- Sheet per document type
- Header row with field names
- Data rows with extracted values
- Color-coded by confidence
### JSON Output
```json
{
"doc_type": "invoice",
"fields": {
"invoice_number": "...",
"date": "...",
"amount": "...",
"buyer": "...",
"seller": "..."
},
"confidence": 0.95
}
```
---
## Security Notes
- **AI API calls**: Uses `requests.post` to OpenAI-compatible endpoints with user-provided API key (not stored)
- **Data storage**: Uses `/tmp/pdf-extractor/` for temporary processing files (no home directory write)
- **OCR**: Local processing via EasyOCR/Tesseract (no external data transmission)
- **Billing data**: `FEISHU_USER_ID` transmitted to `skillpay.me/api/v1/billing` for per-call charging
---
## Billing
- Billing via `skillpay.me/api/v1/billing/charge`
- User data transmitted to SkillPay for billing identification
- $0.01 USD per extraction call (PRO tier)
---
## Required Environment Variables
| Variable | Description |
|----------|-------------|
| `FEISHU_USER_ID` | User open_id for billing |
| `SKILL_BILLING_API_KEY` | SkillPay Builder API Key |
| `SKILL_BILLING_SKILL_ID` | SkillPay Skill ID (default: pdf-extractor) |
---
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `NO_TEXT_EXTRACTED` | Scanned PDF without OCR | Enable OCR or use digital PDF |
| `UNSUPPORTED_DOC_TYPE` | Document type not recognized | Specify type manually |
| `API_ERROR` | AI API key invalid or quota exceeded | Check API key |don't have the plugin yet? install it then click "run inline in claude" again.