Batch PubMed citation verifier — detect AI-fabricated references in one click. Five-state verdict: ✅ Correct / ⚠️ Mismatch / 🔶 Partial / ❌ Invalid / ❓ Unkno...
---
name: pubmed-verifier
description: PubMed citation verifier / PubMed文献引用验证 — detect AI-fabricated references with fuzzy matching / 模糊匹配检测AI编造文献
Keywords: PMID验证, PubMed引用核查, 文献审计, AI幻觉检测, 学术写作, 医学文献,
批量验证, 引用核对, 论文引用检查, reference validation, citation audit,
PubMed API, systematic review QA, academic integrity, pharmacovigilance.
Triggers: "验证PMID", "检查引用", "核查文献", "PMID检查", "引用验证",
"verify PMIDs", "check citations", "validate references", "audit PMID",
"文献验证", "引用核对", "batch verify references", "检查引用", "验证文献".
---
# PubMed Citation Verifier v2.1
Five-state batch verification of PMID citations via PubMed E-utilities API, with SQLite caching, CSV support, and Crossref DOI verification.
## Verdict Types
| Icon | Verdict | Meaning |
|------|---------|---------|
| ✅ | Correct | PMID exists AND matches claimed paper (title + author/journal) |
| ⚠️ | Mismatch | PMID exists but points to a **different** paper (most common AI hallucination!) |
| 🔶 | Partial | Some metadata matches (e.g., author+journal match but title differs) |
| ❌ | Invalid | PMID not found in PubMed |
| ❓ | Unknown | Insufficient claimed metadata for cross-check |
## Quick Start
```bash
# Verify all PMIDs in a project directory (auto-parses citation context)
python3 scripts/verify_pmids.py --source /path/to/project --output report.html
# Verify specific PMIDs
python3 verify_pmids.py --pmids 31018962,22213727
# Verify with explicit claimed metadata (JSON)
python3 verify_pmids.py --claims '[{"pmid":"34078778","title":"JIA pathogenesis","authors":["Zaripova"],"journal":"Pediatr Rheumatol Online J","year":"2021"}]' --output report.html
# Verify with claims file (JSON or CSV)
python3 verify_pmids.py --claims-file claims.csv --suggest --output report.html
# Verify + DOI cross-check via Crossref
python3 verify_pmids.py --source /path/to/files --verify-doi --output report.html
# Full pipeline with all features
python3 verify_pmids.py --source /path/to/files --verify-doi --suggest --output report.html
```
## What's New in v2.1
| Feature | Description |
|---------|-------------|
| **SQLite Cache** | Verified PMIDs cached locally at `~/.cache/pubmed-verifier/cache.db`. 30-day default expiry. Re-runs on large projects take seconds instead of minutes. |
| **CSV Claims** | `--claims-file` now accepts `.csv` files in addition to JSON. Auto-detects format. Semicolon or pipe-delimited authors supported. |
| **Crossref DOI** | `--verify-doi` cross-references article DOIs via Crossref API for extra confidence. |
| **Retry Logic** | Automatic 3-retry with exponential backoff (1s→2s→4s) on transient API failures. Zero external dependencies. |
| **Dual Fuzzy Matching** | Title matching uses word-level Jaccard overlap (≥50%) + SequenceMatcher (≥90%) as supplementary. |
## How It Works
### 1. Extract PMIDs + Parse Citation Context
Scans files for common PMID patterns (`PMID: 12345678`, PubMed URLs, etc.).
**Automatically parses surrounding citation text** to extract claimed metadata:
- Author surnames (e.g., `Ravelli A, Martini A` → `["Ravelli", "Martini"]`)
- Paper title (between author and journal)
- Journal name (from `<i>...</i>` tags or position)
- Publication year (`20xx` / `19xx`)
Supported file types: `.html`, `.md`, `.txt`, `.json`, `.htm`
### 2. Fetch PubMed Metadata (with Cache)
Each PMID queried via PubMed `esummary` API. Results cached in SQLite for 30 days (configurable via `--cache-days`). Use `--no-cache` to force fresh queries.
Batch requests (50/call, 0.4s delay, 3-retry with exponential backoff).
### 3. Cross-Check: Claimed vs Actual
Dual-strategy fuzzy matching:
- **Primary**: Word-level Jaccard overlap ≥ 50% (handles word reordering, abbreviation expansion)
- **Supplementary**: SequenceMatcher ratio ≥ 90% (catches edge cases)
| Field | Match Logic |
|-------|-------------|
| **Title** | Word overlap ≥ 50% OR SequenceMatcher ≥ 90% |
| **Authors** | ≥1 surname hit for single author claim; ≥2 for multiple |
| **Journal** | Containment match (handles abbreviations) |
| **Year** | Exact match |
**Verdict determination:**
- `title_match AND (author_match OR journal_match)` → ✅ Correct
- `author_match AND journal_match AND NOT title_match` → 🔶 Partial
- Otherwise → ⚠️ Mismatch
### 4. Crossref DOI Verification (Optional, `--verify-doi`)
For articles with DOIs, queries Crossref API to cross-verify title/journal/year as an independent data source.
### 5. Auto-Suggest (Optional, `--suggest`)
For mismatches, searches PubMed using claimed metadata to suggest correct PMIDs (top 3 candidates).
### 6. Topic Relevance (Optional, `--match-keywords`)
⚠️ **Note**: This checks *topic relevance* only (via filename keywords), NOT PMID correctness. Auxiliary screening tool.
### 7. Report Output
| Format | Flag | Use case |
|--------|------|----------|
| HTML | `--output report.html` | Visual review with claimed vs actual comparison, verdict column |
| JSON | `--output report.json` | Programmatic processing |
| Text | default (no --output) | Quick terminal review |
## Using --claims / --claims-file
When you have explicit claimed metadata (e.g., from AI-generated documents):
**JSON array format:**
```json
[
{
"pmid": "34078778",
"title": "Juvenile idiopathic arthritis: from pathogenesis to clinical practice",
"authors": ["Zaripova LN", "Midgley A", "Beresford MW"],
"journal": "Pediatr Rheumatol Online J",
"year": "2021"
}
]
```
**CSV format** (`claims.csv`):
```csv
pmid,title,authors,journal,year
34078778,JIA pathogenesis,Zaripova LN;Midgley A,Pediatr Rheumatol Online J,2021
31018962,FMF classification criteria,Lidar M|Lancet,,2014
```
## CLI Reference
```
python3 scripts/verify_pmids.py [OPTIONS]
Options:
--source PATH File or directory to scan for PMIDs
--pmids P1,P2,... Comma-separated PMIDs to verify directly
--claims JSON JSON string with claimed metadata
--claims-file FILE JSON or CSV file with claimed metadata
--verify-doi Also verify DOIs via Crossref
--suggest Auto-suggest correct PMIDs for mismatches
--match-keywords Check topic relevance (auxiliary)
--threshold FLOAT Keyword match threshold (default: 0.2)
--no-cache Disable cache, always query API
--cache-days N Cache validity in days (default: 30)
--output FILE Output file (.json or .html)
--format FORMAT Output format: json|html|text (default: text)
```
## Fixing Mismatched/Invalid PMIDs
1. Check the report's "Suggested" column (if using `--suggest`)
2. Use PubMed search to find the correct article:
```python
from scripts.verify_pmids import search_pubmed
results = search_pubmed('Ravelli[au] AND juvenile idiopathic arthritis AND Lancet[jour]')
for r in results: print(r["pmid"], r.get("title",""))
```
3. Verify the replacement and update the source file
## API Limits
- No API key required (public E-utilities)
- 3 requests/second without API key
- Script enforces 0.4s delay between batches, 3-retry with backoff
- Batch size: 50 PMIDs per request
- Cache reduces repeated API calls significantly
## Performance
| Scenario | First run | Cached run |
|----------|-----------|------------|
| 225 PMIDs (MedWiki-Rheum) | ~7 min | ~5 sec |
| Single PMID | ~2s | ~1.4s |
## Files
| File | Purpose |
|------|---------|
| `scripts/verify_pmids.py` | Main verification script (v2.1, 1058 lines, zero external dependencies) |
| `references/api_examples.md` | PubMed E-utilities API examples |
don't have the plugin yet? install it then click "run inline in claude" again.