Bohrium PDF Parser

Parse PDF documents via open.bohrium.com. Use when: user asks about extracting text, tables, charts, formulas, or molecules from PDF files on Bohrium, submit...

installs

stars

karma

SkillRank score ↗

7.8/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

bohrium-pdf-parser extracts text, tables, charts, formulas, and molecules from pdf documents via open.bohrium.com. supports url submission and file upload with async polling or synchronous blocking modes.

structure

9.0

trigger phrases

6.0

procedure

9.0

edge cases

7.0

documentation

8.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: bohrium-pdf-parser
description: "Parse PDF documents via open.bohrium.com. Use when: user asks about extracting text, tables, charts, formulas, or molecules from PDF files on Bohrium, submitting PDFs by URL or file upload. NOT for: file management, dataset management, or knowledge base operations."
---

# SKILL: Bohrium PDF Parser

## Overview

Parse PDF documents using the `open.bohrium.com` PDF parsing service. Extract text, tables, charts, formulas, and molecular structures from PDFs. Two submission methods:

- **URL submission** — provide a PDF download link (e.g. arXiv link)
- **File upload** — upload a local PDF file

**No CLI support** — all operations use the HTTP API.

## Authentication

ACCESS_KEY is read from the OpenClaw config `~/.openclaw/openclaw.json`:

```json
"bohrium-pdf-parser": {
  "enabled": true,
  "apiKey": "YOUR_ACCESS_KEY",
  "env": {
    "ACCESS_KEY": "YOUR_ACCESS_KEY"
  }
}
```

OpenClaw automatically injects `env.ACCESS_KEY` into the runtime.

## Common Code Template

```python
import os, time, requests

AK = os.environ.get("ACCESS_KEY", "")
BASE = "https://open.bohrium.com/openapi/v1/parse"
HEADERS = {"accessKey": AK}
HEADERS_JSON = {**HEADERS, "Content-Type": "application/json"}
```

---

## Parsing Workflow

```
1. Submit PDF (URL or file upload) → get token
2. Poll result with token → complete when status == "success"
```

Synchronous mode (`sync=true`) blocks until parsing completes but does not include content in the response — you still need `get-result` to retrieve it. Asynchronous mode (`sync=false`, default) requires polling `get-result` until status is `success`.

---

## URL Submission

```python
r = requests.post(f"{BASE}/trigger-url-async", headers=HEADERS_JSON, json={
    "url": "https://arxiv.org/pdf/2107.06922",
    "sync": False,
    "textual": True,
    "table": True,
    "molecule": True,
    "chart": True,
    "figure": False,
    "expression": True,
    "equation": True,
    "pages": [0],           # 0-indexed, omit to parse all pages
    "timeout": 1800
})
data = r.json()
token = data["token"]
print(f"Token: {token}, Status: {data['status']}")
# Token: 57d12c5a-..., Status: undefined
```

**Response Fields:**

| Field | Description |
|-------|-------------|
| `token` | Task identifier for querying results |
| `status` | Initial status is `undefined` |
| `created_time` | Creation time |
| `time_dict` | Per-stage timing (only `download_pdf` at this point) |

---

## File Upload

```python
from pathlib import Path

pdf_path = Path("./paper.pdf")
with open(pdf_path, "rb") as f:
    r = requests.post(f"{BASE}/trigger-file-async",
        headers=HEADERS,       # No Content-Type; requests handles multipart automatically
        files={"file": (pdf_path.name, f, "application/pdf")},
        data={
            "sync": "false",
            "textual": "true",
            "table": "true",
            "molecule": "true",
            "chart": "true",
            "figure": "false",
            "expression": "true",
            "equation": "true",
            "pages": 0,         # multipart only accepts a single integer
            "timeout": 1800
        })
token = r.json()["token"]
```

> **Important**: `pages` in multipart/form-data only accepts a **single integer** (e.g. `0`), not a JSON array `[0]`, or you'll get an `int_parsing` error. In JSON request bodies, arrays like `[0, 1, 2]` are supported.

---

## Query Parse Result

```python
r = requests.post(f"{BASE}/get-result", headers=HEADERS_JSON, json={
    "token": token,
    "content": True,        # Return extracted text
    "objects": False,        # Return extracted objects (tables, figures, etc.)
    "pages_dict": True       # Return per-page results
})
data = r.json()
print(f"Status: {data['status']}, Content length: {len(data.get('content', ''))}")
```

**Response Fields:**

| Field | Description |
|-------|-------------|
| `status` | `success` / `undefined` (processing) / `failed` |
| `token` | Task identifier |
| `content` | Extracted text (LaTeX markup format) |
| `pages_dict` | Per-page result dictionary |
| `lang` | Detected language (`en` / `zh` etc.) |
| `proc_page` / `total_page` | Processed / total pages |
| `proc_textual` / `total_textual` | Processed / total text blocks |
| `proc_table` / `total_table` | Processed / total tables |
| `proc_mol` / `total_mol` | Processed / total molecules |
| `proc_equa` / `total_equa` | Processed / total equations |
| `time_dict` | Per-stage timing details |
| `cost` | Cost |

---

## Full Async Polling Example

```python
import os, time, requests

AK = os.environ.get("ACCESS_KEY", "")
BASE = "https://open.bohrium.com/openapi/v1/parse"
HEADERS = {"accessKey": AK}
HEADERS_JSON = {**HEADERS, "Content-Type": "application/json"}

# 1. Submit
r = requests.post(f"{BASE}/trigger-url-async", headers=HEADERS_JSON, json={
    "url": "https://arxiv.org/pdf/2107.06922",
    "sync": False,
    "textual": True, "table": True, "molecule": False,
    "chart": False, "figure": False,
    "expression": True, "equation": True,
    "pages": [0],
    "timeout": 1800
})
submit = r.json()
if submit.get("code"):
    print(f"Submit failed: {submit.get('message')}")
    exit(1)

token = submit["token"]
print(f"Submitted, token={token}")

# 2. Poll for result
for attempt in range(30):
    time.sleep(2)
    r = requests.post(f"{BASE}/get-result", headers=HEADERS_JSON, json={
        "token": token,
        "content": True,
        "objects": False,
        "pages_dict": False
    })
    result = r.json()
    status = result.get("status", "")
    print(f"  [{attempt+1}] status={status}")

    if status == "success":
        print(f"Done! Content length: {len(result.get('content', ''))}")
        print(f"Language: {result.get('lang')}, Cost: {result.get('cost')}")
        print(f"Preview: {result.get('content', '')[:200]}")
        break
    elif status == "failed":
        print(f"Failed: {result.get('description', 'unknown error')}")
        break
else:
    print("Timeout: task did not complete within 60 seconds")
```

---

## Synchronous Mode Example

Synchronous mode (`sync=true`) blocks until parsing completes, so no polling is needed. However, the **response does not include the `content` field** — you still need to call `get-result` to retrieve the parsed content:

```python
# 1. Synchronous submit — blocks until parsing completes
r = requests.post(f"{BASE}/trigger-url-async", headers=HEADERS_JSON, json={
    "url": "https://arxiv.org/pdf/2107.06922",
    "sync": True,           # Wait for completion
    "textual": True, "table": True,
    "molecule": False, "chart": False, "figure": False,
    "expression": True, "equation": True,
    "pages": [0],
    "timeout": 1800
})
submit = r.json()
token = submit["token"]
# submit["status"] == "success", but no content field

# 2. Retrieve content
r = requests.post(f"{BASE}/get-result", headers=HEADERS_JSON, json={
    "token": token,
    "content": True, "objects": False, "pages_dict": False
})
result = r.json()
print(f"Content: {result['content'][:200]}")
```

---

## Parse Options Reference

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `sync` | bool | `false` | `true` blocks until complete (still need `get-result` for content), `false` requires polling |
| `textual` | bool | - | Extract text content |
| `table` | bool | - | Extract tables |
| `molecule` | bool | - | Extract molecular structures |
| `chart` | bool | - | Extract charts |
| `figure` | bool | - | Extract figures/images |
| `expression` | bool | - | Extract math expressions |
| `equation` | bool | - | Extract equations |
| `pages` | list[int] | all | Pages to parse (0-indexed) |
| `timeout` | int | - | Timeout in seconds |

---

## curl Examples

```bash
AK="YOUR_ACCESS_KEY"
BASE="https://open.bohrium.com/openapi/v1/parse"

# URL submission
curl -s -X POST "$BASE/trigger-url-async" \
  -H "Content-Type: application/json" \
  -H "accessKey: $AK" \
  -d '{"url":"https://arxiv.org/pdf/2107.06922","sync":false,"textual":true,"table":true,"molecule":false,"chart":false,"figure":false,"expression":true,"equation":true,"pages":[0],"timeout":1800}'

# File upload
curl -s -X POST "$BASE/trigger-file-async" \
  -H "accessKey: $AK" \
  -F "file=@paper.pdf" \
  -F "sync=false" -F "textual=true" -F "table=true" \
  -F "pages=0"

# Query result
curl -s -X POST "$BASE/get-result" \
  -H "Content-Type: application/json" \
  -H "accessKey: $AK" \
  -d '{"token":"YOUR_TOKEN","content":true,"objects":false,"pages_dict":true}'
```

---

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| `AccessKey is required` | Missing or incorrect accessKey | Header name is `accessKey` (case-sensitive), not `Authorization: Bearer` |
| `int_parsing` error | `pages` sent as JSON array in file upload | Use a single integer for `pages` in multipart form |
| `status: undefined` | Async task not yet complete | Poll `get-result` again; recommended interval: 2 seconds |
| Connection timeout | Domain/network issue | Use `open.bohrium.com`; test connectivity via `curl -I https://open.bohrium.com/openapi` |
| Content has LaTeX markup | Normal behavior | Results use `\begin{title}` etc. to mark structure; post-process to extract plain text |
| Large file parses slowly | Many pages or complex content | Use `pages` parameter to limit scope |

don't have the plugin yet? install it then click "run inline in claude" again.

added explicit intent, inputs with auth details and edge cases, procedure as 5 numbered steps with clear inputs/outputs, decision points for URL vs file, sync vs async, error states and missing keys, output contract with data format and field reference, outcome signal tied to response fields and extraction counts.

SKILL: Bohrium PDF Parser

intent

extract text, tables, charts, formulas, equations, and molecular structures from PDF documents using the open.bohrium.com API. use this skill when a user asks to parse or analyze PDF content from a URL (e.g. arXiv) or upload a local file. not for file management, dataset operations, or knowledge base indexing.

inputs

authentication:

ACCESS_KEY environment variable, injected by OpenClaw from ~/.openclaw/openclaw.json under bohrium-pdf-parser.apiKey. required; missing key will cause request to fail with "AccessKey is required" error.

external connection:

open.bohrium.com HTTP API endpoint. requires internet access and connectivity to bohrium's servers. test with curl -I https://open.bohrium.com/openapi.

user input (one of):

URL to a publicly accessible PDF file (e.g. https://arxiv.org/pdf/2107.06922)
local PDF file path (e.g. ./paper.pdf)

parse configuration (optional):

textual (bool): extract text blocks
table (bool): extract tables
molecule (bool): extract molecular structures
chart (bool): extract charts
figure (bool): extract figures and images
expression (bool): extract math expressions
equation (bool): extract equations
pages (list of ints for URL, single int for file upload): 0-indexed page numbers to parse (omit to parse all)
timeout (int, seconds): max time to wait (default 1800)
sync (bool, default false): block until complete or return immediately and poll

procedure

step 1: prepare headers and base URL

input: ACCESS_KEY from environment
create headers dict with key accessKey set to ACCESS_KEY (case-sensitive)
base URL: https://open.bohrium.com/openapi/v1/parse
output: headers dict, base URL ready for POST requests

step 2a: submit PDF via URL

input: public PDF URL, parse options

POST to {BASE}/trigger-url-async with headers and JSON body:

{
  "url": "<PDF URL>",
  "sync": false,
  "textual": true,
  "table": true,
  "molecule": true,
  "chart": true,
  "figure": false,
  "expression": true,
  "equation": true,
  "pages": [0],
  "timeout": 1800
}

output: JSON response with fields token, status (initially "undefined"), created_time, time_dict
extract token for polling

step 2b: submit PDF via file upload

input: local file path, parse options
POST to {BASE}/trigger-file-async as multipart/form-data (do not set Content-Type header; requests library handles it)
file field: (filename, file_object, "application/pdf")
data fields: sync, textual, table, molecule, chart, figure, expression, equation, pages (single integer, not array), timeout
output: JSON response with token
critical: pages parameter in multipart must be a single integer (e.g. 0) not a JSON array, or int_parsing error occurs

step 3: poll result (if async)

input: token from step 2

POST to {BASE}/get-result with headers and JSON body:

{
  "token": "<token>",
  "content": true,
  "objects": false,
  "pages_dict": true
}

output: JSON response with status, content, pages_dict, lang, progress fields (proc_page, total_page, proc_table, total_table, etc.), time_dict, cost
repeat every 2 seconds until status == "success" or "failed"

step 4: handle sync mode (optional)

input: token from sync=true submission
sync mode blocks in step 2 until parsing completes, but does NOT return content in response
must call get-result (step 3) to retrieve parsed content even after sync completion

step 5: extract and process result

input: JSON response from get-result with status == "success"
output: content field contains extracted text in LaTeX markup format (e.g. \begin{title}, \end{title} mark sections); pages_dict contains per-page breakdown; lang is detected language code (e.g. "en", "zh")
post-process content if plain text is needed (strip or parse LaTeX markup)

decision points

if user provides a URL: use step 2a (trigger-url-async). URLs must be publicly accessible (no auth headers).

if user provides a file: use step 2b (trigger-file-async). file must exist locally and be readable.

if sync=false (default): use step 3 polling loop. must implement retry logic with 2-second interval and timeout (~60 seconds or 30 attempts recommended). if polling times out, task may still complete on server; store token for manual re-query.

if sync=true: step 2 blocks but does not return content. always call step 3 (get-result) afterward to retrieve parsed data.

if status == "failed": check description field in response for error details. common causes: invalid PDF URL (404), corrupt file, timeout during parsing, or unsupported content type.

if status == "undefined" after poll timeout: task is still processing. retry polling or advise user to check back later with the same token.

if result is empty (content length == 0): PDF may be image-only, scanned without OCR, or parse options all set to false. verify parse options match content type.

if missing ACCESS_KEY: request will fail immediately with "AccessKey is required". user must configure OpenClaw ~/.openclaw/openclaw.json and restart runtime.

if network timeout or 5xx error: bohrium service may be down or unreachable. retry with exponential backoff; if persistent, advise user to test curl -I https://open.bohrium.com/openapi.

if pages parameter causes int_parsing error in file upload: user likely sent array [0, 1] instead of single int 0. convert to single page index or omit pages to parse all.

output contract

success state:

HTTP 200 response from get-result endpoint
JSON body with status: "success"
content field populated with extracted text (LaTeX format)
lang field with detected language
pages_dict field with per-page results (if requested)
progress fields (proc_page, total_page, proc_table, total_table, proc_mol, total_mol, proc_equa, total_equa) showing extraction counts
time_dict showing per-stage timing (download_pdf, parse_page, extract_text, etc.)
cost field with resource cost estimate

data format:

extracted text uses LaTeX markup (e.g. \begin{title}Title\end{title}, \begin{equation}x=y\end{equation})
tables, molecules, equations stored as separate entries in pages_dict[page_num]['objects'] if objects=true
language code (ISO 639-1 format, e.g. "en", "zh", "fr")

file location:

no files written to disk by default; all data returned in JSON response
user must parse and save response if persistence is needed

outcome signal

user knows the skill worked when:

polling loop completes with status == "success" (not "undefined" or "failed")
content field contains non-empty string with extracted text
proc_page >= expected page count and proc_textual > 0 (at least one text block extracted)
detected language in lang field matches PDF content
time_dict shows all parse stages completed (download_pdf, parse_page, extract_text, etc.)
if tables/equations requested, corresponding proc_table and proc_equa counts > 0
user can copy/view extracted text from content field

Bohrium PDF Parser

related skills

SKILL: Bohrium PDF Parser

intent

inputs

procedure

decision points

output contract

outcome signal