Measures OpenClaw model performance by scoring token throughput, first-token latency, tool call speed, context efficiency, and error recovery ability.
SKILL.md

# OpenClaw Performance Benchmark Skill

3DMark-style performance benchmark for OpenClaw. Produces an **unbounded composite score** — higher is better, no upper limit, designed to grow with hardware and model improvements.

## What It Measures

| Dimension | Metric | Impact |
|-----------|--------|--------|
| 模型吞吐 | tokens/sec (generation) | Primary score driver |
| 首 Token 延迟 | TTFT in ms | Bonus for fast response |
| 工具调用效率 | avg tool call latency | Bonus for fast tools |
| 初始上下文 | session 启动时的 token 数 | 越重分越低 |
| 上下文效率 | context ratio (usable/raw) | Penalty if heavy context |
| 错误恢复 | pass rate across tests | Penalty for failures |

## Score Formula

```
Score = (Base + TTFT_bonus + Tool_bonus) × Context_ratio × Recovery

Base         = gen_tok/s × 10            ← 无上限
TTFT_bonus   = 10000 ÷ TTFT_ms          ← 越快越高
Tool_bonus   = 10000 ÷ tool_avg_ms      ← 越快越高
Context_ratio= 20000 ÷ initial_ctx_tokens × (actual_tok/s ÷ raw_tok/s)
               ↑                           ↑
               直接惩罚上下文大小          间接惩罚吞吐损失
               20k=1.0, 40k=0.5, 80k=0.25
Recovery     = 通过数 ÷ 总数             ← 0~1
```

Context_ratio 由两部分组成：
1. **上下文大小惩罚**: 20000 ÷ initial_ctx_tokens（以 20k 为基准，越大越低）
2. **吞吐损失比**: 实际吞吐 ÷ 原始吞吐（测量模型被上下文拖慢的程度）

两者相乘，既惩罚「上下文本身很重」，也惩罚「上下文导致吞吐下降」。

Grade scale: S+ (≥2000) → S (≥1000) → A (≥500) → B (≥200) → C (≥50) → D

## File Structure

```
~/.openclaw/skills/openclaw-benchmark/
├── SKILL.md          ← 本文件（协议说明）
└── score.py          ← 评分 + 报告生成

~/Downloads/OpenClaw-Benchmark/
├── results/          ← 跑分结果 HTML
└── baselines/        ← 基线数据 JSON（用于前后对比）
```

---

## Benchmark Protocol

### Step 0: System Pre-flight

Collect system info before running tests:

```bash
node --version
python3 --version
ls ~/.openclaw/skills/ | wc -l
```

Record: openclaw version, node version, os, arch, skill count, system prompt token estimate.

Check for common config issues:
- 是否有大量未使用的 skill（增加上下文负担）
- system prompt 是否过长
- 是否有 compaction 配置

### Step 1: Raw Model Speed (Test 1)

Spawn subagent:
```
直接回答，不要调用任何工具。用中文解释量子纠缠的基本原理，300字左右。
```

Record: runtime, output tokens → gen_tok_s = output / runtime

### Step 2: Complex Reasoning / TTFT (Test 2)

Spawn subagent:
```
直接回答，不要调用任何工具。解决以下问题：

一个水池有两个进水管A和B，一个排水管C。A管单独注满需要6小时，B管单独注满需要8小时，C管单独排空需要12小时。如果三管同时打开，多少小时能注满水池？请给出详细的解题过程和最终答案（分数形式）。
```

Record: runtime, complexity of answer

### Step 3: Tool Call Latency (Test 3)

Spawn subagent:
```
用 web_search 搜索 "OpenClaw AI assistant"，只搜一次。把搜索结果的标题列出来，不要做其他操作。
```

Record: runtime, tool_count → tool_avg_ms = runtime * 1000 / tool_count

### Step 4: File I/O Chain (Test 4)

Spawn subagent:
```
依次执行以下操作，每步完成后记录结果：
1. 用 exec 执行: echo "benchmark test $(date +%s)" > /tmp/openclaw_bench.txt
2. 用 read 读取 /tmp/openclaw_bench.txt 的内容
3. 用 exec 执行: rm /tmp/openclaw_bench.txt
把每步的操作和结果写入报告。
```

Record: runtime

### Step 5: Multi-Step Chain (Test 5)

Spawn subagent:
```
依次执行以下操作：
1. 用 exec 执行: node --version
2. 用 exec 执行: python3 --version
3. 对比两个版本号，用一句话说明哪个更新
不要并行执行命令，按顺序执行。
```

Record: runtime

### Step 6: Error Recovery (Test 6)

Spawn subagent:
```
依次执行：
1. 用 web_fetch 访问 https://httpstat.us/500 （会返回错误）
2. 访问失败后，用 web_search 搜索 "http status 500 meaning"
3. 根据搜索结果，用一句话解释 HTTP 500 错误
```

Record: runtime, whether fallback succeeded

---

## Step 7: Write Metrics & Compute Score

Write all metrics to `/tmp/bench_metrics.json`:

```json
{
  "gen_tok_s": 50.0,
  "ttft_ms": 800,
  "tool_avg_ms": 35500,
  "context_ratio": 0.50,
  "recovery_rate": 1.0,
  "system": {
    "os": "Darwin 24.6.0",
    "arch": "arm64",
    "openclaw_version": "2026.5.22",
    "node_version": "v25.2.1",
    "skill_count": 20,
    "system_prompt_tokens": 5000
  },
  "model": {
    "name": "xiaomi-coding/mimo-v2.5",
    "context_window": "1M",
    "provider": "xiaomi"
  },
  "tests": [
    { "id": 1, "name": "原始生成速度", "duration_s": 9, "total_tokens": 5500, "output_tokens": 450, "tool_calls": 0, "status": "ok" }
  ]
}
```

Run scorer:
```bash
python3 ~/.openclaw/skills/openclaw-benchmark/score.py /tmp/bench_metrics.json
```

Report auto-saves to `~/Downloads/OpenClaw-Benchmark/results/bench_<时间戳>.html`

---

## Step 8: Baseline Management (前后对比)

Save current run as baseline:
```bash
cp /tmp/bench_metrics.json ~/Downloads/OpenClaw-Benchmark/baselines/<name>.json
```

Compare against baseline:
```bash
python3 ~/.openclaw/skills/openclaw-benchmark/score.py /tmp/bench_metrics.json --compare ~/Downloads/OpenClaw-Benchmark/baselines/<name>.json
```

Comparison output shows:
- Score delta (e.g. +120 / -45)
- Per-metric deltas with color coding:
  - 🟢 改善 > 10%
  - 🟡 持平 ±10%
  - 🔴 退步 > 10%

### Naming conventions for baselines

- `default.json` — 默认配置基线
- `minimal.json` — 精简 skill 后的基线
- `new-model.json` — 换模型后的基线
- `after-optimize.json` — 优化后的基线

---

## Metrics JSON Schema

```json
{
  "gen_tok_s": 50.0,
  "ttft_ms": 200.0,
  "tool_avg_ms": 2000.0,
  "context_ratio": 0.85,
  "recovery_rate": 1.0,
  "system": {
    "os": "Darwin 24.6.0",
    "arch": "arm64",
    "openclaw_version": "2026.5.22",
    "node_version": "v25.2.1",
    "skill_count": 20,
    "system_prompt_tokens": 5000
  },
  "model": {
    "name": "xiaomi-coding/mimo-v2.5",
    "context_window": "1M",
    "provider": "xiaomi"
  },
  "tests": [
    {
      "id": 1,
      "name": "原始生成速度",
      "duration_s": 55,
      "total_tokens": 6600,
      "output_tokens": 450,
      "tool_calls": 0,
      "status": "ok"
    }
  ]
}
```

---

## Optimization Checklist

When score is low, check these in order:

| 检查项 | 影响维度 | 优化方向 |
|--------|---------|---------|
| Skill 数量过多 | context_ratio | 移除未使用的 skill |
| System prompt 过长 | context_ratio | 精简 AGENTS.md / SOUL.md |
| 模型选择 | gen_tok_s | 换更快的模型 |
| 网络环境 | tool_avg_ms | 检查 VPN/代理配置 |
| 无 compaction 配置 | context_ratio | 设置 triggerAtPercent: 75 |
| 流式模式未优化 | ttft_ms | 使用 chunked/full 模式 |

---

## Notes

- Run benchmarks in a **clean session** (no prior context) for accurate results
- Network-dependent tests (Test 3, 6) may vary; run multiple times and take median
- Context ratio: run Test 1 with minimal context vs full context to measure burden
- Score is designed to be **reproducible** — same system should get similar scores (±10%)
- Save results over time to track performance trends after config changes
- Baselines are JSON files, safe to git-track for team sharing
OpenClaw Benchmark

SKILL.md

related skills