Bohrium Dataset Management

Manage Bohrium datasets via bohr CLI or open.bohrium.com API. Use when: user asks about creating/listing/deleting datasets on Bohrium, uploading data, or man...

view source

installs

stars

karma

SkillRank score ↗

7.3/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

bohrium-dataset covers dataset lifecycle management on the bohrium platform via bohr cli and rest api, including creation with resumable upload, versioning, mounting in jobs and notebooks, and permission control.

structure

9.0

trigger phrases

8.0

procedure

8.0

edge cases

6.0

documentation

7.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: bohrium-dataset
description: "Manage Bohrium datasets via bohr CLI or open.bohrium.com API. Use when: user asks about creating/listing/deleting datasets on Bohrium, uploading data, or managing dataset versions. NOT for: file management, job submission, or node management."
---

# SKILL: Bohrium Dataset Management

## Overview

Manage datasets on the Bohrium platform. **Prefer `bohr` CLI**; fall back to the API for version management, quota checks, etc.

`bohr dataset create` advantages over web upload: **no size limit** and **resumable upload**.

Datasets solve common pain points:
- Repeated file upload on every job submission -> mount datasets to avoid re-upload
- Large input files with slow upload -> datasets support resumable upload
- Need to share data with collaborators -> datasets support project-level sharing

## Authentication

```json
"bohrium-dataset": {
  "enabled": true,
  "apiKey": "YOUR_ACCESS_KEY",
  "env": { "ACCESS_KEY": "YOUR_ACCESS_KEY" }
}
```

## Prerequisites: Install bohr CLI

```bash
# macOS
/bin/bash -c "$(curl -fsSL https://dp-public.oss-cn-beijing.aliyuncs.com/bohrctl/1.0.0/install_bohr_mac_curl.sh)"
# Linux
/bin/bash -c "$(curl -fsSL https://dp-public.oss-cn-beijing.aliyuncs.com/bohrctl/1.0.0/install_bohr_linux_curl.sh)"
source ~/.bashrc && export PATH="$HOME/.bohrium:$PATH"
export OPENAPI_HOST=https://open.bohrium.com
```

---

## List Datasets

```bash
bohr dataset list                       # Default: recent 50
bohr dataset list -n 10 --json          # JSON, top 10
bohr dataset list -p 154                # Filter by project ID
bohr dataset list -t "my-dataset"       # Search by title
```

**JSON fields:** `id`, `title`, `path` (mount path like `/bohr/my-dataset/v1`), `projectName`, `creatorName`, `updateTime`, `desc`

---

## Create Dataset (Upload Data)

```bash
bohr dataset create \
  -n "my-dataset" \
  -p "my-dataset" \
  -i 154 \
  -l "/path/to/local/data"
```

| Parameter | Short | Required | Description |
|-----------|-------|----------|-------------|
| `--name` | `-n` | Yes | Dataset name |
| `--path` | `-p` | Yes | Dataset path identifier (alphanumeric) |
| `--pid` | `-i` | Yes | Project ID |
| `--lp` | `-l` | Yes | Local data directory path |
| `--comment` | `-m` | No | Description |

> **Resumable upload**: If interrupted (network issues, etc.), re-run the same command and enter `y` to resume from breakpoint.

---

## Using Datasets

### Mount in Compute Jobs

Add `dataset_path` to `job.json`:

```json
{
  "job_name": "DeePMD-kit test",
  "command": "cd se_e2_a && dp train input.json",
  "project_id": 154,
  "machine_type": "c4_m15_1 * NVIDIA T4",
  "job_type": "container",
  "image_address": "registry.dp.tech/dptech/deepmd-kit:2.1.5-cuda11.6",
  "dataset_path": ["/bohr/my-dataset/v1", "/bohr/another-dataset/v2"]
}
```

> `dataset_path` and `-p` (input directory) can be used simultaneously.

### Mount on Dev Nodes

Select datasets when creating a container node; access via path (e.g. `/bohr/my-dataset/v1`).

- Adds 2-4s boot delay (regardless of count)
- Use `df -a | grep bohr` to view mount points

### Use in Notebooks

1. Expand side panel in Notebook editor -> Select existing datasets
2. Hover dataset name -> click copy to get path
3. Use in code: `cd /bohr/testdataset-6xwt/v1/`

> Datasets must be added **before** connecting to the node. Adding afterward requires a node restart.

---

## Version Management

Datasets support multi-version management. Files within a version are immutable once created.

### Create New Version

Via Web UI: Dataset details -> "New Version" -> system imports latest version files -> add/remove files -> Create.

Via API:
```python
requests.post(f"{BASE}/{dataset_id}/version", headers=HEADERS_JSON,
    json={"versionDesc": "v2 update"})
```

Preparation time depends on file size and count.

---

## Delete Datasets

```bash
bohr dataset delete 138201              # Single
bohr dataset delete 138201 108601       # Batch
```

> Deleted versions cannot be recovered.

---

## Permission Model

| Permission | Description | Default holders |
|-----------|-------------|-----------------|
| Manageable | Edit, delete, create versions | Dataset creator, project creator/admin |
| Usable | View and use | All project members |

> "Usable" permission can be granted to other projects or users via editing.

---

## API Supplement (CLI Unsupported)

```python
import os, requests

AK = os.environ.get("ACCESS_KEY", "")
BASE = "https://open.bohrium.com/openapi/v1/ds"
HEADERS = {"accessKey": AK}
HEADERS_JSON = {**HEADERS, "Content-Type": "application/json"}

# Dataset details
r = requests.get(f"{BASE}/{dataset_id}", headers=HEADERS)

# Version list
r = requests.get(f"{BASE}/{dataset_id}/version", headers=HEADERS)
# Returns: [{version, totalCount, totalSize, downloadUri, datasetPath, ...}]

# Specific version
r = requests.get(f"{BASE}/{dataset_id}/version/{version_id}", headers=HEADERS)

# Create via API
r = requests.post(f"{BASE}/", headers=HEADERS_JSON, json={
    "title": "my-dataset", "projectId": 154,
    "identifier": "my-dataset",  # Required, unique ID
})
# Returns: {datasetId, tiefbluePath, requestId}
# Then upload files via tiefblue, then call commit

# Commit
requests.put(f"{BASE}/commit", headers=HEADERS_JSON,
    json={"datasetId": dataset_id})

# New version
requests.post(f"{BASE}/{dataset_id}/version", headers=HEADERS_JSON,
    json={"versionDesc": "v2 update"})

# Update info
requests.put(f"{BASE}/{dataset_id}", headers=HEADERS_JSON,
    json={"title": "new-title"})

# Delete version
requests.delete(f"{BASE}/{dataset_id}/version/{version_id}", headers=HEADERS)

# Check quota
r = requests.get(f"{BASE}/quota/check", headers=HEADERS,
    params={"projectId": 154})
# Returns: {result: true, limit: 30, used: 5}

# Upload token (for tiefblue)
r = requests.get(f"{BASE}/input/token", headers=HEADERS,
    params={"projectId": 154, "path": "/bohr/my-dataset"})

# Permissions
r = requests.get(f"{BASE}/{dataset_id}/permission", headers=HEADERS)

# Associated projects
r = requests.get(f"{BASE}/project", headers=HEADERS)
```

**Important**: The dataset list API path is `GET /v1/ds/` (**with trailing slash**), not `/v1/ds/list` (`/list` gets caught by the `/:id` route).

---

## Status Codes

| status | Meaning |
|--------|---------|
| 1 | Creating / uncommitted |
| 2 | Committed / available |

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| Upload interrupted | Network instability | Re-run same command, enter `y` to resume |
| Dataset path not found | Wrong mount path | Check `path` with `bohr dataset list --json` |
| Job can't access dataset | Not in job.json | Add `"dataset_path": ["/bohr/xxx/v1"]` |
| `/ds/list` returns error | Route caught by `/:id` | Use `GET /ds/` (root path) |
| Missing `identifier` error | Required field | Add `identifier` (alphanumeric) |
| Version preparing (~5 min) | Files being copied | Large files take time; contact support on failure |
| Dataset unavailable in Notebook | Added after node connection | Restart node to take effect |

related skills

semantically similar in the cross-vendor index

clawhub

71% match

Bohrium Job Management

Manage Bohrium compute jobs via bohr CLI or open.bohrium.com API. Use when: user asks about submitting/listing/killing/deleting compute jobs on Bohrium, chec...

don't have the plugin yet? install it then click "run inline in claude" again.

SKILL: Bohrium Dataset Management

intent

manage datasets on the Bohrium platform by creating, listing, deleting, and versioning data collections that can be mounted into compute jobs or dev nodes. use this when a user asks about storing reusable input data, avoiding repeated uploads, sharing datasets across projects, or checking dataset quotas. the skill covers both the bohr CLI (preferred for most operations) and the open.bohrium.com REST API (needed for version management, quota checks, and permission queries).

inputs

External connection: Bohrium open API

endpoint: https://open.bohrium.com/openapi/v1/ds
auth method: API key (access token)
setup: set env var ACCESS_KEY to your Bohrium access key, or pass it as "accessKey" header in HTTP requests
scope: full dataset lifecycle (CRUD, versioning, quota, permissions)
rate limits: not documented; assume standard REST API throttling; respect exponential backoff on 429 responses

Prerequisites: bohr CLI

macOS: /bin/bash -c "$(curl -fsSL https://dp-public.oss-cn-beijing.aliyuncs.com/bohrctl/1.0.0/install_bohr_mac_curl.sh)"
Linux: /bin/bash -c "$(curl -fsSL https://dp-public.oss-cn-beijing.aliyuncs.com/bohrctl/1.0.0/install_bohr_linux_curl.sh)"
post-install: run source ~/.bashrc && export PATH="$HOME/.bohrium:$PATH" and export OPENAPI_HOST=https://open.bohrium.com

Required parameters

PROJECT_ID: numeric ID of the Bohrium project where dataset will live
DATASET_NAME: human-readable name (e.g. "my-dataset")
DATASET_PATH: alphanumeric identifier used in mount paths (e.g. "my-dataset")
LOCAL_DATA_DIR: filesystem path to local data directory to upload

Optional parameters

DATASET_DESCRIPTION: text description (flag -m or --comment)
VERSION_ID: for version-specific queries
PAGE_SIZE: for list queries (default 50)

procedure

step 1: list existing datasets

input: project ID (optional), search term (optional), page size (optional)

run one of:

bohr dataset list                    # top 50 datasets
bohr dataset list -n 10 --json       # top 10, JSON output
bohr dataset list -p 154             # filter by project ID 154
bohr dataset list -t "my-dataset"    # search by title

output: table or JSON with fields: id, title, path (e.g. /bohr/my-dataset/v1), projectName, creatorName, updateTime, desc

step 2: create and upload dataset (CLI method, preferred)

input: dataset name (-n), path identifier (-p), project ID (-i), local directory path (-l), optional description (-m)

run:

bohr dataset create \
  -n "my-dataset" \
  -p "my-dataset" \
  -i 154 \
  -l "/path/to/local/data" \
  -m "optional description"

output: dataset ID, mount path (e.g. /bohr/my-dataset/v1), status 1 (creating/uncommitted)

edge case: resumable upload on network interrupt

if upload fails midway, re-run the same command
CLI will prompt "resume from breakpoint? (y/n)"
enter y to continue from last checkpoint
do not use this if you intend to upload different files

step 3: create and upload dataset (API method, for advanced workflows)

input: access key, project ID, dataset title, dataset identifier, file content (staged in tiefblue)

step 3a: create dataset metadata

import os, requests
AK = os.environ.get("ACCESS_KEY")
BASE = "https://open.bohrium.com/openapi/v1/ds"
HEADERS_JSON = {"accessKey": AK, "Content-Type": "application/json"}

r = requests.post(f"{BASE}/", headers=HEADERS_JSON, json={
    "title": "my-dataset",
    "projectId": 154,
    "identifier": "my-dataset"
})
# output: {datasetId, tiefbluePath, requestId}
dataset_id = r.json()["datasetId"]
tiefblue_path = r.json()["tiefbluePath"]

output: dataset ID (use in step 3b), tiefblue upload path

step 3b: upload files to tiefblue staging area (not covered in detail here; use tiefblue client or S3-compatible upload)

step 3c: commit dataset

r = requests.put(f"{BASE}/commit", headers=HEADERS_JSON,
    json={"datasetId": dataset_id})

output: status transitions from 1 (creating) to 2 (committed/available)

step 4: create a new version from existing dataset (web UI method)

input: dataset ID, existing version with files, new files to add/remove, version description

in web UI (open.bohrium.com):

go to dataset details
click "New Version"
system auto-imports latest version files
add or remove files as needed
click "Create"

output: new version ID, new mount path (e.g. /bohr/my-dataset/v2), status 1 (preparing)

edge case: version preparation time

file count and size determine prep time (5-30 min typical)
if preparation stalls, contact support
once ready, status becomes 2 (available)

step 5: create a new version via API

input: dataset ID, version description text

r = requests.post(f"{BASE}/{dataset_id}/version", headers=HEADERS_JSON,
    json={"versionDesc": "v2 update"})

output: new version ID, status 1 (preparing)

step 6: list all versions of a dataset

input: dataset ID

r = requests.get(f"{BASE}/{dataset_id}/version", headers=HEADERS)
# output: [{version, totalCount, totalSize, downloadUri, datasetPath, ...}]

output: list of version objects with mount paths, sizes, timestamps

step 7: mount dataset in a compute job

input: dataset mount path(s) (e.g. /bohr/my-dataset/v1), job.json file

edit or create job.json:

{
  "job_name": "DeePMD-kit test",
  "command": "cd se_e2_a && dp train input.json",
  "project_id": 154,
  "machine_type": "c4_m15_1 * NVIDIA T4",
  "job_type": "container",
  "image_address": "registry.dp.tech/dptech/deepmd-kit:2.1.5-cuda11.6",
  "dataset_path": ["/bohr/my-dataset/v1", "/bohr/another-dataset/v2"]
}

submit job via Bohrium UI or bohr job submit (not covered here)

output: dataset mounted at specified path inside container; files readable at runtime

note: dataset_path and -p (input directory) can both be specified; they do not conflict

step 8: mount dataset on a dev node

input: dataset path (e.g. /bohr/my-dataset/v1)

in Bohrium UI, when creating a container node:

select "datasets" section in side panel
pick existing datasets
click to copy mount path

at node startup:

dataset mounts automatically
boot delay adds 2-4 seconds regardless of dataset count
verify with df -a | grep bohr

output: dataset accessible at path (e.g. /bohr/my-dataset/v1) on mounted filesystem

edge case: mounting after node connection

if you add a dataset after the node is already running, the node must restart to see it
adding datasets before connection is the correct workflow

step 9: use dataset in a Jupyter notebook

input: dataset path

in Bohrium Notebook editor:

expand left side panel
select "datasets"
hover over dataset name and click "copy" to get full path
use in code: import os; os.chdir("/bohr/testdataset-6xwt/v1/")

output: dataset files accessible in notebook kernel at specified path

edge case: same as step 8

datasets must be added before connecting to the node
adding after connection requires node restart

step 10: check dataset quota

input: project ID, access key

r = requests.get(f"{BASE}/quota/check", headers=HEADERS,
    params={"projectId": 154})
# output: {result: true, limit: 30, used: 5}

output: quota object with limit (max datasets), used (current count), result (boolean success)

step 11: update dataset metadata

input: dataset ID, new title (or other fields)

r = requests.put(f"{BASE}/{dataset_id}", headers=HEADERS_JSON,
    json={"title": "new-title"})

output: status 200, dataset title updated

step 12: delete dataset

input: one or more dataset IDs

delete single:

bohr dataset delete 138201

delete multiple:

bohr dataset delete 138201 108601

output: dataset removed (cannot be recovered)

edge case: cascading deletes

deleting a dataset removes all its versions
jobs using that dataset will fail at next run if they reference the deleted path
no warning is issued; coordinate with team before deleting

step 13: delete a specific version

input: dataset ID, version ID

r = requests.delete(f"{BASE}/{dataset_id}/version/{version_id}", headers=HEADERS)

output: version removed; other versions unaffected

step 14: query dataset details (single dataset)

input: dataset ID

r = requests.get(f"{BASE}/{dataset_id}", headers=HEADERS)

output: dataset object with id, title, path, status, createdTime, updatedTime, projectName, creatorName, desc

step 15: get upload token for tiefblue (advanced)

input: project ID, dataset path

r = requests.get(f"{BASE}/input/token", headers=HEADERS,
    params={"projectId": 154, "path": "/bohr/my-dataset"})

output: temporary auth token for tiefblue upload endpoint

step 16: query permissions on a dataset

input: dataset ID

r = requests.get(f"{BASE}/{dataset_id}/permission", headers=HEADERS)

output: permission list showing which users/projects can view (usable) or edit (manageable)

step 17: list projects associated with your user

input: access key

r = requests.get(f"{BASE}/project", headers=HEADERS)

output: list of project objects accessible to authenticated user

decision points

if user wants to upload data for the first time: use step 2 (CLI bohr dataset create). it handles resumable upload, requires no tiefblue knowledge, and has no size limits.

else if user already has a dataset and wants to add a new version: use step 4 (web UI) or step 5 (API). if the version is large (100k+ files), expect 5-30 min prep time and check status via step 6.

if user wants to mount data in a job: use step 7. add dataset_path array to job.json with full mount paths (e.g. /bohr/my-dataset/v1).

if user wants to mount data on a dev node: use step 8. select datasets in UI before connecting to the node. if already connected, restart the node (step 8 edge case).

if user is working in a Notebook: use step 9. same constraint as step 8: add datasets before connection.

if user needs to check how many datasets they can still create: use step 10 (quota check via API). if used >= limit, delete unused datasets (step 12).

if user is uploading via API (advanced workflow): use steps 3a, 3b, 3c. this is slower and more complex than step 2, but allows custom file staging logic.

if the list API returns an error about /ds/list route: the API route /v1/ds/list is caught by the catch-all /:id pattern. always use GET /v1/ds/ (root path with trailing slash) instead.

if create fails with "missing identifier" error: the identifier field is required in step 3a. it must be unique within the project and alphanumeric (hyphens ok).

if a version is stuck in status 1 (preparing) for more than 30 min: contact Bohrium support. do not retry manually.

if dataset is not visible in a Notebook after creation: check that the node was not yet connected when you added the dataset. if already connected, restart the node via Bohrium UI.

output contract

for list operations (step 1):

output format: ASCII table (default) or JSON (with --json flag)
JSON fields: id (integer), title (string), path (string, mount path), projectName (string), creatorName (string), updateTime (ISO 8601 timestamp), desc (string or null)
example JSON row: {"id": 138201, "title": "my-dataset", "path": "/bohr/my-dataset/v1", "projectName": "proj-1", "creatorName": "user@example.com", "updateTime": "2024-01-15T10:30:00Z", "desc": "training data"}

for create operations (steps 2, 3):

output: dataset ID (integer), mount path (string, e.g. /bohr/my-dataset/v1), status code (integer: 1 = creating, 2 = committed)
file location: files persisted in Bohrium distributed storage, accessible via mount path

for version operations (steps 4, 5, 6):

output: version ID (integer), new mount path (string), status code (integer), totalCount (file count), totalSize (bytes), downloadUri (string or null)
example: {"version": 2, "datasetPath": "/bohr/my-dataset/v2", "totalCount": 150, "totalSize": 5368709120, "downloadUri": "https://...", "status": 1}

for delete operations (steps 12, 13):

output: HTTP 200 OK (no body) or error response (JSON with message field)

for quota check (step 10):

output format: JSON object with result (boolean), limit (integer, max dataset count), used (integer, current count)
example: {"result": true, "limit": 30, "used": 5}

for mount operations (steps 7, 8, 9):

output: files available in container/node/notebook kernel at specified path (e.g. /bohr/my-dataset/v1/); use ls /bohr/my-dataset/v1/ to verify

for permission queries (step 16):

output: JSON array of permission objects, each with userId or projectId (string), permission (string: "manageable" or "usable")

for API calls with errors:

output: HTTP status code + JSON body with message field explaining the error
common codes: 400 (bad request, missing field), 401 (unauthorized, bad access key), 404 (dataset not found), 409 (conflict, identifier already exists), 429 (rate limit)

outcome signal

successful list: table or JSON appears in stdout with at least one dataset row; if filtering by project or title, only matching rows appear.

successful upload (step 2): CLI prints "dataset created" or similar message with dataset ID and mount path; bohr dataset list --json shows the new dataset with status 2 (committed).

successful version creation (steps 4-5): web UI or API returns new version ID; bohr dataset list --json shows new mount path (e.g. /bohr/my-dataset/v2); version status is initially 1 (preparing), then 2 (available) after 5-30 min.

successful job mount (step 7): job starts and runs without "dataset not found" errors; cat /bohr/my-dataset/v1/filename inside the container shows expected file content.

successful node/notebook mount (steps 8-9): df -a | grep bohr lists the dataset path; ls /bohr/my-dataset/v1/ returns file listing; files are readable by the notebook kernel or node user.

successful delete: bohr dataset list no longer shows the deleted dataset ID; attempting to access the dataset path in a new job returns mount error.

successful quota check: JSON response includes result: true and shows used < limit; if used >= limit, user must delete datasets before creating new ones.

general API success: HTTP 200 (GET, HEAD), 201 (POST), 204 (DELETE), or 200 (PUT); response body is valid JSON matching documented schema.

original source: clawhub created by: (not declared)