Benchmark sglang serving performance on AMD Instinct GPUs (MI355X, MI300X, MI308X) with various parallel configurations (TP, DP, EP). Covers throughput/laten...
---
name: sglang-amd-bench
description: >
Benchmark sglang serving performance on AMD Instinct GPUs (MI355X, MI300X, MI308X)
with various parallel configurations (TP, DP, EP). Covers throughput/latency sweeps
(ISL, OSL, concurrency), TTFT/TPOT measurement, and config comparison. Mix mode only.
---
# SGLang AMD Benchmark
Benchmark sglang LLM serving on AMD Instinct GPUs across parallel configurations (TP/DP/EP) and workload shapes (ISL/OSL/Concurrency). This skill runs in **mix mode** (non-disaggregated) — prefill and decode happen on the same GPUs. It produces a performance baseline and suggests config-level optimizations.
## Run Rules (non-negotiable)
These rules apply to every benchmark run in this skill. (A profiling-stage-separation rule exists in the broader sglang-run guidance but is intentionally omitted here, since this skill does not profile.)
### Rule 1 — Do NOT modify the sglang/aiter/mori environment
**Never** run `pip install`, `pip uninstall`, `pip install --upgrade`, or any equivalent reinstall command for `sglang`, `aiter`, `mori`, `flydsl`, or any related kernel/runtime package — even if a workload fails or imports look broken. The user's environments are hand-tuned dev installs (typically `pip install -e .`); a naive reinstall will silently overwrite local patches and destroy hours of work.
If the environment looks broken (missing module, version mismatch, ABI error, import crash), **STOP** and report the symptom to the user. Let the user decide whether to reinstall.
What you CAN do without asking:
- Inspect versions: `pip show sglang`, `python -c "import sglang; print(sglang.__file__)"`
- Read source files in the editable install
- Set environment variables for the run
What you MUST ask before doing:
- `pip install` / `pip uninstall` / `pip install -U` for any package above
- `git checkout` / `git pull` inside the editable source directories
- Modifying files inside `sglang/`, `aiter/`, `mori/` source trees
### Rule 2 — Always preserve server logs when launching an sglang server
Whenever you start an sglang server, redirect stdout+stderr to a real file. Never let server output go only to the terminal or to `/dev/null`. The Bash tool's `run_in_background: true` buffer is **not** a substitute — still redirect to a file.
In this skill, `serve.sh` writes to `$LOG_DIR/server_<LABEL>.log` automatically — that's what satisfies this rule, and what `wait_for_server.py` (Rule 3) reads.
### Rule 3 — Wait for the server with the bundled monitor, don't blind-sleep
After launching an sglang server, startup typically takes a few minutes (model load, weight shard, kernel warmup, graph capture; AITER may JIT-compile CK kernels for several minutes on first launch). Do **not** `sleep 300` and hope. Use the bundled monitor — it polls the log and returns the moment the outcome is known:
```bash
# After 3-0 deploys it, the script lives at /sgl-workspace/wait_for_server.py inside the container.
python3 /sgl-workspace/wait_for_server.py "$SERVER_LOG"
# exit codes:
# 0 READY — saw "The server is fired up and ready to roll"
# 1 CRASHED — saw "Traceback"
# 2 HUNG — log's last line + line count unchanged for >5 min
# 3 TIMEOUT — overall timeout (default 30 min) exceeded
# 4 ERROR — log file unreadable / never appeared
```
Source lives at `scripts/wait_for_server.py` in this skill's directory; 3-0 copies it to `/sgl-workspace/` alongside `serve.sh` / `bench.sh`. Detection logic:
- **Success**: substring `The server is fired up and ready to roll` appears.
- **Crash**: substring `Traceback` appears.
- **Hang**: each poll records `(line_count, last_non_empty_line)` of the log; unchanged for ≥5 minutes (`--stall-seconds`) → treated as failed.
Tunable flags: `--success`, `--failure`, `--stall-seconds`, `--overall-timeout`, `--poll-seconds`. Bump `--stall-seconds` consciously if a specific config genuinely has long quiet periods (e.g. very large weight downloads, prolonged AITER JIT).
On `CRASHED` / `HUNG` / `TIMEOUT` / `ERROR`: stop and report the log tail to the user; do NOT silently restart.
## Important Notes
- This skill covers **mix mode only** (no PD-disaggregation). Prefill and decode run on the same GPUs.
- `serve.sh` sets `SGLANG_USE_AITER=1` automatically. `bench.sh` sets `PYTHONPATH` for sglang's benchmark module automatically. No need to set these manually.
- **Use dummy weights by default** (`LOAD_DUMMY=1`). Dummy weights are sufficient for benchmarking throughput, latency, and parallel config comparison — real weights produce the same performance characteristics. Only use `LOAD_DUMMY=0` if the user explicitly asks for real weights. Real weights take much longer to load (10+ minutes for large models) and are rarely needed for config benchmarking.
- `--random-range-ratio 1.0` ensures exact ISL/OSL lengths (no variation) for reproducible benchmarks.
- `bench.sh` uses `num_prompts = concurrency * 2` — this is handled by the script automatically.
- Between configs, fully kill the sglang server and wait for GPU memory to be freed before relaunching.
- If a benchmark run fails or hangs, check GPU memory usage with `rocm-smi` and server health with the `/health` endpoint.
## Key Metrics
Every benchmark collects these metrics per (ISL, OSL, Concurrency) combination:
| Metric | Unit | Description |
| ------------------ | ----- | --------------------------------------------------------- |
| TTFT | ms | Time To First Token — latency from request to first token |
| TPOT | ms | Time Per Output Token — average inter-token latency |
| Input throughput | tok/s | Input tokens processed per second across all requests |
| Output throughput | tok/s | Output tokens generated per second across all requests |
| Total throughput | tok/s | Input + Output token throughput combined |
| Per-GPU throughput | tok/s | Total throughput / number of GPUs |
Per-GPU throughput is the most important efficiency metric — it shows how well each GPU is utilized. Two configs might have similar total throughput, but the one using fewer GPUs has better per-GPU throughput and is more cost-efficient.
## Common Workspace Layout
The standard development environment uses `/sgl-workspace` as the root workspace inside Docker containers:
```
/sgl-workspace/
├── sglang/ # sglang source (installed via pip -e, dev mode)
├── aiter/ # AITER source (AMD AI Tensor Engine)
├── mori/ # Mori (communication library)
└── <model_short>_<YYYYMMDD>/ # benchmark output directories (created by this skill)
```
All benchmark artifacts (logs, reports) are saved under `/sgl-workspace/` by default. If the user specifies a different workspace, use that instead.
## Core Principle: Ask First, Execute Later
**Do NOT guess or assume any configuration.** Every detail must be explicitly confirmed by the user before execution begins. The workflow has two distinct phases:
1. **Planning phase** (Steps 0–1): Gather ALL information through conversation. Ask questions, wait for answers. Do not proceed to the next question until the current one is answered.
2. **Confirmation gate** (Step 2): Present the complete plan as a summary. Get explicit "go ahead" from the user.
3. **Execution phase** (Steps 3–4): Only after full confirmation, run the benchmarks.
If at any point you're unsure about a parameter, **ask**. Never fill in a value the user hasn't confirmed.
## Workflow
### Step 0: Model & Environment Discovery
**Ask the user these questions one by one. Wait for each answer before asking the next.**
#### 0a. Model selection — ask this FIRST
**"Which model do you want to benchmark?"**
The user may respond with:
- A full HuggingFace model ID (e.g., `deepseek-ai/DeepSeek-R1-0528`)
- A short name (e.g., "DeepSeek R1", "Llama 70B", "Qwen 235B")
- A local path to the model weights
If the user gives a short name, confirm the exact model ID (e.g., "Do you mean `deepseek-ai/DeepSeek-R1-0528`?").
#### 0b. Single-node or multi-node?
**"Is this single-node or multi-node?"**
- Single-node: 1 node, typically 8 GPUs
- Multi-node: ask how many nodes and GPUs per node
If multi-node, also ask for:
- Network interface (`GLOO_SOCKET_IFNAME`)
- InfiniBand HCAs (`NCCL_IB_HCA`)
- Head node IP (`SGLANG_HOST_IP`)
#### 0c. Access the GPU node
**"How do I access the GPU node?"**
- SSH command? (e.g., `ssh user@gpu-node`)
- Docker container? (e.g., `docker exec -it <container> bash`)
- Already on the machine?
- For multi-node: ask about access to each node
#### 0d. Probe the environment
Once connected, probe automatically (no need to ask — just run and report back):
- Run `rocm-smi --showid` → report GPU count, model (MI355X, MI300X, MI308X), architecture
- Run `pip show sgl-kernel 2>/dev/null && python3 -c "import sglang; print('sglang version:', sglang.__version__)"` → report sglang version
- Run `pip list | grep -i aiter` → report AITER status
- Check common paths: `/sgl-workspace/sglang`, `/sgl-workspace/aiter`, `/sgl-workspace/mori`
**PYTHONPATH probe (important for Docker environments):** When running inside Docker containers via `docker exec -d` (non-interactive), `.bashrc` is often not sourced due to `[ -z "$PS1" ] && return` guards. This can cause `PYTHONPATH` to be missing paths for editable installs (aiter, mori, sglang), leading to import errors like `ImportError: aiter is required when SGLANG_USE_AITER is set to True`. The `serve.sh` script auto-detects and adds common workspace paths (`/sgl-workspace/aiter`, `/sgl-workspace/mori`, `/sgl-workspace/sglang/python`) to `PYTHONPATH` if they exist but are missing. However, if you encounter import errors, compare the environments:
```bash
# Non-interactive PYTHONPATH (what docker exec -d sees)
docker exec <container> bash -c 'echo $PYTHONPATH'
# Interactive PYTHONPATH (what the user sees)
docker exec <container> bash -ic 'echo $PYTHONPATH' 2>/dev/null
```
If they differ, ensure the missing paths are exported before running `serve.sh`.
**If any probe reveals a broken package or missing dependency, follow Rule 1 above: report and stop. Do NOT `pip install/uninstall` sglang/aiter/mori or otherwise modify the environment yourself.**
#### 0e. Locate model weights
The user may or may not have specified where the model weights are stored. If they haven't provided a path, do a quick search — but don't waste time on this:
Quick places to check:
- `$HUGGINGFACE_HUB_CACHE` env var
- `~/.cache/huggingface/hub/`
- Common mount points: `/mnt`, `/raid`, `/data`
Note: HuggingFace cache stores models as `models--<Org>--<Name>/snapshots/<hash>/`. For example, `Qwen/Qwen3.5-397B-A17B-FP8` would be at `models--Qwen--Qwen3.5-397B-A17B-FP8/snapshots/<hash>/`. Look for this pattern.
If you find a match, confirm with the user:
> "I found what looks like the model weights at `/data/models/DeepSeek-R1-0528/`. Is this the right location?"
If nothing turns up quickly, ask:
> "I couldn't find the model weights on this machine. Where are they stored?"
The `--model-path` can be either:
- A **local path** directly to the weights (e.g., `/data/models/DeepSeek-R1/`)
- A **HuggingFace model ID** (e.g., `Qwen/Qwen3.5-397B-A17B-FP8`) — but only if the weights already exist in `$HUGGINGFACE_HUB_CACHE`. If the weights are at `$HUGGINGFACE_HUB_CACHE/models--<Org>--<Name>`, using the HF model ID is preferred. You can also `export HUGGINGFACE_HUB_CACHE=<path>` to point to the right cache dir.
Do NOT let sglang trigger a model download — the weights must already be on disk.
#### 0f. Report findings and confirm
Present everything you found to the user:
> "Here's what I have so far:
>
> - **Model**: deepseek-ai/DeepSeek-R1-0528
> - **Weights**: /data/models/DeepSeek-R1-0528/
> - **GPUs**: 8x MI355X (gfx950)
> - **sglang**: v0.5.x at /sgl-workspace/sglang
> - **AITER**: installed
> - **Setup**: single-node
>
> Does this look right? Anything I should know about this environment?"
### Step 1: Configuration Planning
**Ask each of these questions explicitly. Do not move forward until you have clear answers for ALL of them.**
#### 1a. MTP decision (if applicable)
If the model is MTP-capable (detected via `mtp_num_hidden_layers` in config.json, or known models like DeepSeek-R1/V3, Qwen3.5), ask:
**"This model supports Multi-Token Prediction (MTP), which can improve decode throughput. MTP is configured by a step count `N` (`MTP=0` disables it; `MTP=N` for `N>0` enables N speculative steps). By default we run with `MTP=0` for a clean baseline. What would you like to do?"**
1. Run with `MTP=0` only (baseline)
2. Run with `MTP=N` for a chosen `N` (ask the user for `N`)
3. Run both `MTP=0` and `MTP=N`, and compare
If the user wants MTP enabled, determine:
- **MTP steps** (`MTP=N`, where `N` is an integer ≥ 1, NOT a 0/1 toggle). If unsure, ask the user.
- **MTP algorithm** (`MTP_ALGO`): model-dependent — see `references/server_config.md` for the per-model table
`serve.sh` handles all speculative decoding flags (`--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk`, `--speculative-num-draft-tokens`) automatically from `MTP` and `MTP_ALGO`.
#### 1b. Server setup
Check if a sglang server is already running — don't ask the user, just probe:
```bash
curl -s http://localhost:30000/health && echo "Server is running" || echo "No server running"
pgrep -fa "sglang.launch_server" || true
```
- If a server is running: inform the user and ask whether to shut it down or use it as-is. By default, shut it down so the skill controls the server lifecycle for each config.
- If no server is running: good — the skill will launch one for each config.
Ask: **"Any additional sglang launch flags you want to use?"** (e.g., `--quantization fp8`, `--chunked-prefill-size`, `--schedule-policy`, etc.)
Note: `--disable-radix-cache` is enabled by default in `serve.sh` for benchmarking. User can opt out with `DISABLE_RADIX_CACHE=0`.
#### 1c. Parallel configurations
This is the most important decision in the benchmark. Read `references/server_config.md` for the full reference on parallelism types, naming conventions, EP modes, and how to reason about config choices.
**Before asking the user**, do the following:
1. **Read the model's `config.json`** from the weights directory directly (it's short). Look for KV heads, Q heads, expert count, and detect attention type (MLA/GQA/MHA). See `references/server_config.md` for the key fields to look for — but note that field names vary across models, so read carefully.
2. **Analyze** the 4 factors described in `references/server_config.md` → "How to Reason About Parallel Config":
- Weight size vs GPU HBM → which TP values fit?
- Attention type + KV heads → TP or DP-attention?
- MoE vs Dense → EP applicable?
- EP mode → all-to-all or all-reduce?
3. **Present your analysis to the user** — show your reasoning (weight size calc, KV head implications, why certain configs are better). Then present a suggested config table and **ask the user to pick**.
4. **If EP is involved**, ask which EP mode (all-to-all or all-reduce), or suggest benchmarking both.
Wait for the user to respond. If they say "try all of them" or "you decide", confirm your suggested set before proceeding.
#### 1d. Benchmark sweep parameters
**"What ISL (input sequence length), OSL (output sequence length), and concurrency levels do you want to sweep?"**
If the user isn't sure, offer options but still ask them to pick:
> "Some common approaches:
>
> 1. **Specific pairs** — e.g., (ISL=512, OSL=256), (ISL=1024, OSL=512) — good for simulating real workloads
> 2. **Full sweep** — provide separate ISL, OSL, and CON lists, benchmark all combinations
>
> Which approach? And what values?"
If the user says "you pick" or "whatever makes sense", then suggest values and **ask for confirmation before proceeding**:
> "Here's what I'd suggest:
>
> - ISL: 128, 512, 1024, 2048, 4096
> - OSL: 128, 512, 1024, 2048
> - Concurrency: 1, 16, 64, 128, 256
>
> That's 5 × 4 × 5 = 100 runs per config, times 2 configs = 200 total runs.
> Estimated ~3+ hours. Want to proceed with these, or adjust?"
### Step 2: Confirmation Gate
**Do NOT start any benchmark until this step is complete.**
#### Naming convention
Use this pattern for directories:
```
BENCH_DIR=/sgl-workspace/<model_short>_<YYYYMMDD>
```
Per-config dirs: `<CONFIG>_mtp<N>` where `N` is the MTP step count (`0` = off, e.g. `DP8EP8_mtp0`, `TP8_mtp0`, `DP8EP8_mtp3`)
#### Present the plan summary
> **Benchmark Plan Summary**
>
>
> | Item | Value |
> | --------- | -------------------------------------- |
> | Model | deepseek-ai/DeepSeek-R1-0528 |
> | GPU | 8x MI355X |
> | Mode | Mix (non-disaggregated) |
> | Bench dir | `/sgl-workspace/DeepSeek-R1_20260322/` |
>
>
> **Sweep:** ISL=[128, 512, 1024, 2048], OSL=[128, 512, 1024], CON=[1, 16, 64, 128, 256]
#### Confirm configs with dry-run
For each parallel config, **actually run `scripts/serve.sh` with `DRY_RUN=1`** on the GPU node — do NOT construct the launch command manually. The dry-run output shows the exact command that will be executed, ensuring consistency between what the user confirms and what actually runs.
For a small number of configs (2-3), present all dry-run outputs at once. For many configs, present them one by one. Get confirmation before proceeding to execution.
```bash
BENCH_DIR=/sgl-workspace/<model_short>_$(date +%Y%m%d)
# Config 1 — dry run
MODEL_PATH=<MODEL_PATH> CONFIG=DP8EP8_A2A MTP=0 \
LOG_DIR=$BENCH_DIR/DP8EP8_A2A_mtp0 DRY_RUN=1 bash serve.sh
# Config 2 — dry run
MODEL_PATH=<MODEL_PATH> CONFIG=TP8 MTP=0 \
LOG_DIR=$BENCH_DIR/TP8_mtp0 DRY_RUN=1 bash serve.sh
```
Show the **full dry-run output** (including the complete formatted sglang launch command with all flags) to the user and ask: **"Do these configs look right?"**
If the user wants changes, adjust and re-run the dry run. Once confirmed, proceed to Step 3.
### Step 3: Benchmark Execution
Only proceed here after the user has confirmed ALL configs in Step 2.
**Always use `serve.sh` and `bench.sh` to launch the server and run benchmarks.** Do NOT construct sglang commands manually — the scripts handle critical flags (`--enable-dp-attention`, `--enable-dp-lm-head`, `SGLANG_USE_AITER`, `PYTHONPATH`, etc.) that are easy to miss.
#### 3-0. Deploy benchmark scripts to the remote node
The `scripts/serve.sh`, `scripts/bench.sh`, `scripts/stop.sh`, `scripts/verify_stop.sh`, and `scripts/wait_for_server.py` files live in the skill directory on the local machine. `serve.sh`/`bench.sh`/`stop.sh`/`wait_for_server.py` run inside the container; `verify_stop.sh` MUST run on the host (so it can see PIDs from sibling containers).
```bash
# From local: scripts → remote node → into container (verify_stop.sh stays on the host)
scp scripts/serve.sh scripts/bench.sh scripts/stop.sh scripts/verify_stop.sh scripts/wait_for_server.py <SSH_HOST>:/tmp/
ssh <SSH_HOST> "docker cp /tmp/serve.sh <CONTAINER>:/sgl-workspace/ && docker cp /tmp/bench.sh <CONTAINER>:/sgl-workspace/ && docker cp /tmp/stop.sh <CONTAINER>:/sgl-workspace/ && docker cp /tmp/wait_for_server.py <CONTAINER>:/sgl-workspace/"
```
Alternatively, if you're already inside the container, write the script content directly using `cat > /sgl-workspace/serve.sh << 'SCRIPT' ... SCRIPT`.
**Important:** Avoid running scripts through nested `ssh → docker exec → bash -c` with inline heredocs — the quoting becomes unmanageable. Always copy scripts to the remote first, then run them simply with `bash serve.sh`.
#### For each parallel config:
**3a. Launch sglang server**
Launch in background so you can proceed to benchmarking:
```bash
MODEL_PATH=<MODEL_PATH> CONFIG=<CONFIG> MTP=<N> \
LOG_DIR=$BENCH_DIR/<CONFIG>_mtp<N> \
BACKGROUND=1 bash serve.sh
```
`serve.sh` writes the server's stdout+stderr to `$LOG_DIR/server_<LABEL>.log`, which is what satisfies Rule 2 (persistent server log) and what `wait_for_server.py` in 3b reads.
If the user already has a running server, skip the launch and use their URL.
**3b. Wait for server ready**
Per Rule 3 above, use the bundled `scripts/wait_for_server.py` — do NOT `sleep` blindly and do NOT roll your own `tail -f | grep` loop. The script already handles stall detection (≥ 5 min unchanged) and avoids matching benign substrings like `Ignore import error` / `UserWarning`.
```bash
# Script was copied to /sgl-workspace/ in 3-0 alongside serve.sh / bench.sh.
SERVER_LOG=$(ls -t $BENCH_DIR/<CONFIG>_mtp<N>/server_*.log | head -1)
python3 /sgl-workspace/wait_for_server.py "$SERVER_LOG"
# exit codes:
# 0 READY — saw "The server is fired up and ready to roll"
# 1 CRASHED — saw "Traceback"; stop and report tail of $SERVER_LOG to user
# 2 HUNG — log stalled ≥ --stall-seconds (default 300s); stop and report
# 3 TIMEOUT — overall --overall-timeout (default 1800s) exceeded
# 4 ERROR — log file unreadable / never appeared
```
If AITER JIT compilation legitimately produces long quiet periods on a particular config, bump `--stall-seconds` (and/or `--overall-timeout`) explicitly rather than swallowing a HUNG. On any non-zero exit, **stop** and report the log tail to the user — do NOT silently relaunch.
**3c. Run benchmark**
`bench.sh` no longer writes per-run logs itself. Set `OUTPUT_DIR`; per-run JSONL is written to `${OUTPUT_DIR}/jsonl_dir/` and **you MUST capture stdout+stderr with `2>&1 | tee $OUTPUT_DIR/<name>.log`**.
```bash
OUTPUT_DIR=$BENCH_DIR/<CONFIG>_mtp<N> \
MODEL_PATH=<MODEL_PATH> ISL=<ISL> OSL=<OSL> \
CONCURRENCY="<CON1> <CON2> <CON3>" \
bash bench.sh 2>&1 | tee $OUTPUT_DIR/bench_ISL<X>_OSL<Y>.log
```
For multiple ISL/OSL combinations, loop (remember `2>&1 | tee` per invocation):
```bash
export OUTPUT_DIR=$BENCH_DIR/<CONFIG>_mtp<N>
for ISL in 128 512 1024 2048; do
for OSL in 128 512 1024; do
MODEL_PATH=<MODEL_PATH> ISL=$ISL OSL=$OSL \
CONCURRENCY="1 16 64 128 256" \
bash bench.sh 2>&1 | tee $OUTPUT_DIR/bench_ISL${ISL}_OSL${OSL}.log
done
done
```
**3d. Stop server and repeat**
Kill sglang inside the container, then verify on the host (sibling-container PIDs are invisible from within the container):
```bash
ssh <SSH_HOST> "docker exec <CONTAINER> bash /sgl-workspace/stop.sh"
ssh <SSH_HOST> bash /tmp/verify_stop.sh # exit 0 = GPUs free; non-zero prints offending PIDs
```
**If a config crashes:** Report the error, run `stop.sh` then `verify_stop.sh`, and move on to the next config. Do NOT debug kernel issues or retry. Document the crash and error message in the final report.
Repeat 3a–3d for each parallel config.
### Step 4: Report
After all configs are benchmarked, generate structured CSV data, a performance plot, and a Markdown report.
#### 4a. Generate CSV from JSONL
For each config directory, run `jsonl_to_csv.py` to extract metrics into an InferenceX-compatible CSV:
```bash
python3 /sgl-workspace/jsonl_to_csv.py \
--jsonl-dir $BENCH_DIR/<CONFIG>_mtp<N>/jsonl_dir \
--hardware <HARDWARE> \
--precision <PRECISION> \
--model <MODEL_NAME> \
--date <YYYY-MM-DD> \
--output $BENCH_DIR/<CONFIG>_mtp<N>/<MODEL>_<HARDWARE>_<PRECISION>.csv
```
Required args:
- `--hardware`: GPU hardware name (e.g. `mi355x`, `b200`, `b300`)
- `--precision`: weight precision (e.g. `fp4`, `fp8`, `bf16`)
Optional args:
- `--model`: model display name (default: auto-detected from model path)
- `--date`: benchmark date (default: today)
- `--output`: output CSV path (default: auto-named in jsonl-dir parent)
The CSV follows InferenceX format with all standard columns (throughput/GPU, TTFT, TPOT, interactivity, ITL, E2E latency, etc.). Time values are stored in **seconds** (matching InferenceX convention, despite column headers saying "ms"). Interactivity = 1000 / TPOT(ms).
#### 4b. Generate performance plot
Run `plot_interactivity.py` to produce a **Token Throughput per GPU vs. Interactivity** chart from one or more CSVs:
```bash
python3 /sgl-workspace/plot_interactivity.py \
$BENCH_DIR/<CONFIG1>/<CSV1>.csv \
$BENCH_DIR/<CONFIG2>/<CSV2>.csv \
-o $BENCH_DIR/interactivity_plot.png
```
You can also include reference CSVs (e.g. from InferenceX) alongside your benchmark CSVs to produce comparison plots. Optional args: `--title`, `--subtitle`, `--dpi` (default: 150).
#### 4c. Write Markdown report
Write a Markdown report to `$BENCH_DIR/benchmark_report.md` that includes:
- Configuration summary (model, GPUs, mode, MTP status)
- Per-config results tables with all metrics + per-GPU throughput
- Cross-config comparison highlighting the best performer for each metric
- Reference to the generated CSV and plot files
Present the report to the user and walk them through the key findings.
## File Organization
```
/sgl-workspace/<model_short>_<YYYYMMDD>/
├── benchmark_report.md # final report
├── DP4EP4_mtp0/ # per-config directory
│ ├── server_DP4EP4_mtp0.log # sglang server log (from serve.sh)
│ ├── bench_ISL4096_OSL1024.log # bench.sh stdout/stderr (you capture via `2>&1 | tee`)
│ └── jsonl_dir/ # raw JSONL written by bench.sh --output-file
│ ├── bench_ISL4096_OSL1024_CON64.jsonl
│ ├── bench_ISL4096_OSL1024_CON128.jsonl
│ └── ...
├── TP8_mtp0/
│ ├── server_TP8_mtp0.log
│ └── ...
└── DP8EP8_A2A_mtp1/
└── ...
```
Each config gets its own directory. `serve.sh` writes `server_<LABEL>.log` into `LOG_DIR`. `bench.sh` writes JSONL into `OUTPUT_DIR`; capture its stdout/stderr to the same `OUTPUT_DIR` via `2>&1 | tee $OUTPUT_DIR/<bench>.log`.
don't have the plugin yet? install it then click "run inline in claude" again.