Manage Bohrium datasets via bohr CLI or open.bohrium.com API. Use when: user asks about creating/listing/deleting datasets on Bohrium, uploading data, or man...
---
name: bohrium-dataset
description: "Manage Bohrium datasets via bohr CLI or open.bohrium.com API. Use when: user asks about creating/listing/deleting datasets on Bohrium, uploading data, or managing dataset versions. NOT for: file management, job submission, or node management."
---
# SKILL: Bohrium Dataset Management
## Overview
Manage datasets on the Bohrium platform. **Prefer `bohr` CLI**; fall back to the API for version management, quota checks, etc.
`bohr dataset create` advantages over web upload: **no size limit** and **resumable upload**.
Datasets solve common pain points:
- Repeated file upload on every job submission -> mount datasets to avoid re-upload
- Large input files with slow upload -> datasets support resumable upload
- Need to share data with collaborators -> datasets support project-level sharing
## Authentication
```json
"bohrium-dataset": {
"enabled": true,
"apiKey": "YOUR_ACCESS_KEY",
"env": { "ACCESS_KEY": "YOUR_ACCESS_KEY" }
}
```
## Prerequisites: Install bohr CLI
```bash
# macOS
/bin/bash -c "$(curl -fsSL https://dp-public.oss-cn-beijing.aliyuncs.com/bohrctl/1.0.0/install_bohr_mac_curl.sh)"
# Linux
/bin/bash -c "$(curl -fsSL https://dp-public.oss-cn-beijing.aliyuncs.com/bohrctl/1.0.0/install_bohr_linux_curl.sh)"
source ~/.bashrc && export PATH="$HOME/.bohrium:$PATH"
export OPENAPI_HOST=https://open.bohrium.com
```
---
## List Datasets
```bash
bohr dataset list # Default: recent 50
bohr dataset list -n 10 --json # JSON, top 10
bohr dataset list -p 154 # Filter by project ID
bohr dataset list -t "my-dataset" # Search by title
```
**JSON fields:** `id`, `title`, `path` (mount path like `/bohr/my-dataset/v1`), `projectName`, `creatorName`, `updateTime`, `desc`
---
## Create Dataset (Upload Data)
```bash
bohr dataset create \
-n "my-dataset" \
-p "my-dataset" \
-i 154 \
-l "/path/to/local/data"
```
| Parameter | Short | Required | Description |
|-----------|-------|----------|-------------|
| `--name` | `-n` | Yes | Dataset name |
| `--path` | `-p` | Yes | Dataset path identifier (alphanumeric) |
| `--pid` | `-i` | Yes | Project ID |
| `--lp` | `-l` | Yes | Local data directory path |
| `--comment` | `-m` | No | Description |
> **Resumable upload**: If interrupted (network issues, etc.), re-run the same command and enter `y` to resume from breakpoint.
---
## Using Datasets
### Mount in Compute Jobs
Add `dataset_path` to `job.json`:
```json
{
"job_name": "DeePMD-kit test",
"command": "cd se_e2_a && dp train input.json",
"project_id": 154,
"machine_type": "c4_m15_1 * NVIDIA T4",
"job_type": "container",
"image_address": "registry.dp.tech/dptech/deepmd-kit:2.1.5-cuda11.6",
"dataset_path": ["/bohr/my-dataset/v1", "/bohr/another-dataset/v2"]
}
```
> `dataset_path` and `-p` (input directory) can be used simultaneously.
### Mount on Dev Nodes
Select datasets when creating a container node; access via path (e.g. `/bohr/my-dataset/v1`).
- Adds 2-4s boot delay (regardless of count)
- Use `df -a | grep bohr` to view mount points
### Use in Notebooks
1. Expand side panel in Notebook editor -> Select existing datasets
2. Hover dataset name -> click copy to get path
3. Use in code: `cd /bohr/testdataset-6xwt/v1/`
> Datasets must be added **before** connecting to the node. Adding afterward requires a node restart.
---
## Version Management
Datasets support multi-version management. Files within a version are immutable once created.
### Create New Version
Via Web UI: Dataset details -> "New Version" -> system imports latest version files -> add/remove files -> Create.
Via API:
```python
requests.post(f"{BASE}/{dataset_id}/version", headers=HEADERS_JSON,
json={"versionDesc": "v2 update"})
```
Preparation time depends on file size and count.
---
## Delete Datasets
```bash
bohr dataset delete 138201 # Single
bohr dataset delete 138201 108601 # Batch
```
> Deleted versions cannot be recovered.
---
## Permission Model
| Permission | Description | Default holders |
|-----------|-------------|-----------------|
| Manageable | Edit, delete, create versions | Dataset creator, project creator/admin |
| Usable | View and use | All project members |
> "Usable" permission can be granted to other projects or users via editing.
---
## API Supplement (CLI Unsupported)
```python
import os, requests
AK = os.environ.get("ACCESS_KEY", "")
BASE = "https://open.bohrium.com/openapi/v1/ds"
HEADERS = {"accessKey": AK}
HEADERS_JSON = {**HEADERS, "Content-Type": "application/json"}
# Dataset details
r = requests.get(f"{BASE}/{dataset_id}", headers=HEADERS)
# Version list
r = requests.get(f"{BASE}/{dataset_id}/version", headers=HEADERS)
# Returns: [{version, totalCount, totalSize, downloadUri, datasetPath, ...}]
# Specific version
r = requests.get(f"{BASE}/{dataset_id}/version/{version_id}", headers=HEADERS)
# Create via API
r = requests.post(f"{BASE}/", headers=HEADERS_JSON, json={
"title": "my-dataset", "projectId": 154,
"identifier": "my-dataset", # Required, unique ID
})
# Returns: {datasetId, tiefbluePath, requestId}
# Then upload files via tiefblue, then call commit
# Commit
requests.put(f"{BASE}/commit", headers=HEADERS_JSON,
json={"datasetId": dataset_id})
# New version
requests.post(f"{BASE}/{dataset_id}/version", headers=HEADERS_JSON,
json={"versionDesc": "v2 update"})
# Update info
requests.put(f"{BASE}/{dataset_id}", headers=HEADERS_JSON,
json={"title": "new-title"})
# Delete version
requests.delete(f"{BASE}/{dataset_id}/version/{version_id}", headers=HEADERS)
# Check quota
r = requests.get(f"{BASE}/quota/check", headers=HEADERS,
params={"projectId": 154})
# Returns: {result: true, limit: 30, used: 5}
# Upload token (for tiefblue)
r = requests.get(f"{BASE}/input/token", headers=HEADERS,
params={"projectId": 154, "path": "/bohr/my-dataset"})
# Permissions
r = requests.get(f"{BASE}/{dataset_id}/permission", headers=HEADERS)
# Associated projects
r = requests.get(f"{BASE}/project", headers=HEADERS)
```
**Important**: The dataset list API path is `GET /v1/ds/` (**with trailing slash**), not `/v1/ds/list` (`/list` gets caught by the `/:id` route).
---
## Status Codes
| status | Meaning |
|--------|---------|
| 1 | Creating / uncommitted |
| 2 | Committed / available |
## Troubleshooting
| Problem | Cause | Solution |
|---------|-------|----------|
| Upload interrupted | Network instability | Re-run same command, enter `y` to resume |
| Dataset path not found | Wrong mount path | Check `path` with `bohr dataset list --json` |
| Job can't access dataset | Not in job.json | Add `"dataset_path": ["/bohr/xxx/v1"]` |
| `/ds/list` returns error | Route caught by `/:id` | Use `GET /ds/` (root path) |
| Missing `identifier` error | Required field | Add `identifier` (alphanumeric) |
| Version preparing (~5 min) | Files being copied | Large files take time; contact support on failure |
| Dataset unavailable in Notebook | Added after node connection | Restart node to take effect |
don't have the plugin yet? install it then click "run inline in claude" again.
manage datasets on the Bohrium platform by creating, listing, deleting, and versioning data collections that can be mounted into compute jobs or dev nodes. use this when a user asks about storing reusable input data, avoiding repeated uploads, sharing datasets across projects, or checking dataset quotas. the skill covers both the bohr CLI (preferred for most operations) and the open.bohrium.com REST API (needed for version management, quota checks, and permission queries).
External connection: Bohrium open API
https://open.bohrium.com/openapi/v1/dsACCESS_KEY to your Bohrium access key, or pass it as "accessKey" header in HTTP requestsPrerequisites: bohr CLI
/bin/bash -c "$(curl -fsSL https://dp-public.oss-cn-beijing.aliyuncs.com/bohrctl/1.0.0/install_bohr_mac_curl.sh)"/bin/bash -c "$(curl -fsSL https://dp-public.oss-cn-beijing.aliyuncs.com/bohrctl/1.0.0/install_bohr_linux_curl.sh)"source ~/.bashrc && export PATH="$HOME/.bohrium:$PATH" and export OPENAPI_HOST=https://open.bohrium.comRequired parameters
PROJECT_ID: numeric ID of the Bohrium project where dataset will liveDATASET_NAME: human-readable name (e.g. "my-dataset")DATASET_PATH: alphanumeric identifier used in mount paths (e.g. "my-dataset")LOCAL_DATA_DIR: filesystem path to local data directory to uploadOptional parameters
DATASET_DESCRIPTION: text description (flag -m or --comment)VERSION_ID: for version-specific queriesPAGE_SIZE: for list queries (default 50)input: project ID (optional), search term (optional), page size (optional)
run one of:
bohr dataset list # top 50 datasets
bohr dataset list -n 10 --json # top 10, JSON output
bohr dataset list -p 154 # filter by project ID 154
bohr dataset list -t "my-dataset" # search by title
output: table or JSON with fields: id, title, path (e.g. /bohr/my-dataset/v1), projectName, creatorName, updateTime, desc
input: dataset name (-n), path identifier (-p), project ID (-i), local directory path (-l), optional description (-m)
run:
bohr dataset create \
-n "my-dataset" \
-p "my-dataset" \
-i 154 \
-l "/path/to/local/data" \
-m "optional description"
output: dataset ID, mount path (e.g. /bohr/my-dataset/v1), status 1 (creating/uncommitted)
edge case: resumable upload on network interrupt
y to continue from last checkpointinput: access key, project ID, dataset title, dataset identifier, file content (staged in tiefblue)
step 3a: create dataset metadata
import os, requests
AK = os.environ.get("ACCESS_KEY")
BASE = "https://open.bohrium.com/openapi/v1/ds"
HEADERS_JSON = {"accessKey": AK, "Content-Type": "application/json"}
r = requests.post(f"{BASE}/", headers=HEADERS_JSON, json={
"title": "my-dataset",
"projectId": 154,
"identifier": "my-dataset"
})
# output: {datasetId, tiefbluePath, requestId}
dataset_id = r.json()["datasetId"]
tiefblue_path = r.json()["tiefbluePath"]
output: dataset ID (use in step 3b), tiefblue upload path
step 3b: upload files to tiefblue staging area (not covered in detail here; use tiefblue client or S3-compatible upload)
step 3c: commit dataset
r = requests.put(f"{BASE}/commit", headers=HEADERS_JSON,
json={"datasetId": dataset_id})
output: status transitions from 1 (creating) to 2 (committed/available)
input: dataset ID, existing version with files, new files to add/remove, version description
in web UI (open.bohrium.com):
output: new version ID, new mount path (e.g. /bohr/my-dataset/v2), status 1 (preparing)
edge case: version preparation time
input: dataset ID, version description text
r = requests.post(f"{BASE}/{dataset_id}/version", headers=HEADERS_JSON,
json={"versionDesc": "v2 update"})
output: new version ID, status 1 (preparing)
input: dataset ID
r = requests.get(f"{BASE}/{dataset_id}/version", headers=HEADERS)
# output: [{version, totalCount, totalSize, downloadUri, datasetPath, ...}]
output: list of version objects with mount paths, sizes, timestamps
input: dataset mount path(s) (e.g. /bohr/my-dataset/v1), job.json file
edit or create job.json:
{
"job_name": "DeePMD-kit test",
"command": "cd se_e2_a && dp train input.json",
"project_id": 154,
"machine_type": "c4_m15_1 * NVIDIA T4",
"job_type": "container",
"image_address": "registry.dp.tech/dptech/deepmd-kit:2.1.5-cuda11.6",
"dataset_path": ["/bohr/my-dataset/v1", "/bohr/another-dataset/v2"]
}
submit job via Bohrium UI or bohr job submit (not covered here)
output: dataset mounted at specified path inside container; files readable at runtime
note: dataset_path and -p (input directory) can both be specified; they do not conflict
input: dataset path (e.g. /bohr/my-dataset/v1)
in Bohrium UI, when creating a container node:
at node startup:
df -a | grep bohroutput: dataset accessible at path (e.g. /bohr/my-dataset/v1) on mounted filesystem
edge case: mounting after node connection
input: dataset path
in Bohrium Notebook editor:
import os; os.chdir("/bohr/testdataset-6xwt/v1/")output: dataset files accessible in notebook kernel at specified path
edge case: same as step 8
input: project ID, access key
r = requests.get(f"{BASE}/quota/check", headers=HEADERS,
params={"projectId": 154})
# output: {result: true, limit: 30, used: 5}
output: quota object with limit (max datasets), used (current count), result (boolean success)
input: dataset ID, new title (or other fields)
r = requests.put(f"{BASE}/{dataset_id}", headers=HEADERS_JSON,
json={"title": "new-title"})
output: status 200, dataset title updated
input: one or more dataset IDs
delete single:
bohr dataset delete 138201
delete multiple:
bohr dataset delete 138201 108601
output: dataset removed (cannot be recovered)
edge case: cascading deletes
input: dataset ID, version ID
r = requests.delete(f"{BASE}/{dataset_id}/version/{version_id}", headers=HEADERS)
output: version removed; other versions unaffected
input: dataset ID
r = requests.get(f"{BASE}/{dataset_id}", headers=HEADERS)
output: dataset object with id, title, path, status, createdTime, updatedTime, projectName, creatorName, desc
input: project ID, dataset path
r = requests.get(f"{BASE}/input/token", headers=HEADERS,
params={"projectId": 154, "path": "/bohr/my-dataset"})
output: temporary auth token for tiefblue upload endpoint
input: dataset ID
r = requests.get(f"{BASE}/{dataset_id}/permission", headers=HEADERS)
output: permission list showing which users/projects can view (usable) or edit (manageable)
input: access key
r = requests.get(f"{BASE}/project", headers=HEADERS)
output: list of project objects accessible to authenticated user
if user wants to upload data for the first time:
use step 2 (CLI bohr dataset create). it handles resumable upload, requires no tiefblue knowledge, and has no size limits.
else if user already has a dataset and wants to add a new version: use step 4 (web UI) or step 5 (API). if the version is large (100k+ files), expect 5-30 min prep time and check status via step 6.
if user wants to mount data in a job:
use step 7. add dataset_path array to job.json with full mount paths (e.g. /bohr/my-dataset/v1).
if user wants to mount data on a dev node: use step 8. select datasets in UI before connecting to the node. if already connected, restart the node (step 8 edge case).
if user is working in a Notebook: use step 9. same constraint as step 8: add datasets before connection.
if user needs to check how many datasets they can still create:
use step 10 (quota check via API). if used >= limit, delete unused datasets (step 12).
if user is uploading via API (advanced workflow): use steps 3a, 3b, 3c. this is slower and more complex than step 2, but allows custom file staging logic.
if the list API returns an error about /ds/list route:
the API route /v1/ds/list is caught by the catch-all /:id pattern. always use GET /v1/ds/ (root path with trailing slash) instead.
if create fails with "missing identifier" error:
the identifier field is required in step 3a. it must be unique within the project and alphanumeric (hyphens ok).
if a version is stuck in status 1 (preparing) for more than 30 min: contact Bohrium support. do not retry manually.
if dataset is not visible in a Notebook after creation: check that the node was not yet connected when you added the dataset. if already connected, restart the node via Bohrium UI.
for list operations (step 1):
--json flag)id (integer), title (string), path (string, mount path), projectName (string), creatorName (string), updateTime (ISO 8601 timestamp), desc (string or null){"id": 138201, "title": "my-dataset", "path": "/bohr/my-dataset/v1", "projectName": "proj-1", "creatorName": "user@example.com", "updateTime": "2024-01-15T10:30:00Z", "desc": "training data"}for create operations (steps 2, 3):
/bohr/my-dataset/v1), status code (integer: 1 = creating, 2 = committed)for version operations (steps 4, 5, 6):
{"version": 2, "datasetPath": "/bohr/my-dataset/v2", "totalCount": 150, "totalSize": 5368709120, "downloadUri": "https://...", "status": 1}for delete operations (steps 12, 13):
message field)for quota check (step 10):
result (boolean), limit (integer, max dataset count), used (integer, current count){"result": true, "limit": 30, "used": 5}for mount operations (steps 7, 8, 9):
/bohr/my-dataset/v1/); use ls /bohr/my-dataset/v1/ to verifyfor permission queries (step 16):
userId or projectId (string), permission (string: "manageable" or "usable")for API calls with errors:
message field explaining the errorsuccessful list: table or JSON appears in stdout with at least one dataset row; if filtering by project or title, only matching rows appear.
successful upload (step 2): CLI prints "dataset created" or similar message with dataset ID and mount path; bohr dataset list --json shows the new dataset with status 2 (committed).
successful version creation (steps 4-5): web UI or API returns new version ID; bohr dataset list --json shows new mount path (e.g. /bohr/my-dataset/v2); version status is initially 1 (preparing), then 2 (available) after 5-30 min.
successful job mount (step 7): job starts and runs without "dataset not found" errors; cat /bohr/my-dataset/v1/filename inside the container shows expected file content.
successful node/notebook mount (steps 8-9): df -a | grep bohr lists the dataset path; ls /bohr/my-dataset/v1/ returns file listing; files are readable by the notebook kernel or node user.
successful delete: bohr dataset list no longer shows the deleted dataset ID; attempting to access the dataset path in a new job returns mount error.
successful quota check: JSON response includes result: true and shows used < limit; if used >= limit, user must delete datasets before creating new ones.
general API success: HTTP 200 (GET, HEAD), 201 (POST), 204 (DELETE), or 200 (PUT); response body is valid JSON matching documented schema.
original source: clawhub created by: (not declared)