LLM evaluation & inference performance testing via the evalscope CLI. Translates natural language requests into evalscope commands for: (1) Model accuracy ev...
---
name: evalscope
description: >-
LLM evaluation & inference performance testing via the evalscope CLI.
Translates natural language requests into evalscope commands for:
(1) Model accuracy evaluation — runs 160+ benchmarks against local
checkpoints or API endpoints (OpenAI-compatible, Anthropic, LiteLLM);
(2) Performance stress testing — TTFT, TPOT, throughput, latency under
configurable concurrency; (3) RAG evaluation — RAGAS quality metrics,
MTEB embedding benchmarks, CLIP retrieval; (4) Benchmark discovery —
list/filter/inspect benchmarks by tag. Trigger on: evaluate / benchmark /
score a model, throughput / latency / QPS / stress test, find benchmarks,
view results, 评测模型, 压测, 跑 benchmark, 性能测试, 查看评测结果,
有哪些评测集, RAG 评测, embedding 评测. Do NOT trigger for: model
training / finetuning / deployment / serving requests.
---
# EvalScope
Read only the relevant reference file for the matched workflow — don't preload all of them.
| Workflow | When | Reference |
|----------|------|-----------|
| Eval (accuracy) | evaluate / benchmark / score | [eval-reference.md](eval-reference.md) |
| Perf (stress test) | throughput / latency / QPS / perf | [perf-reference.md](perf-reference.md) |
| RAG Evaluation | RAG / embedding / retrieval quality | [rag-reference.md](rag-reference.md) |
| Visualization | view results / compare / dashboard | (below) |
| Benchmark Discovery | list / find / what benchmarks | (below) |
| Troubleshooting | errors / failures / debug | [troubleshooting.md](troubleshooting.md) |
## Prerequisites
```bash
evalscope --version # verify installation
pip install evalscope # basic
pip install 'evalscope[all]' # all backends (perf, rag, service, aigc)
pip install 'evalscope[perf]' # perf only
pip install 'evalscope[rag]' # RAG only (RAGAS, MTEB, CLIP)
pip install 'evalscope[service]' # Web dashboard
```
## Decision Tree
- User wants **accuracy evaluation** (evaluate / benchmark / score / 评测)
- Local checkpoint path or HuggingFace/ModelScope ID → `--model PATH` (auto `llm_ckpt`)
- API endpoint → `--model NAME --api-url URL` (auto `openai_api`)
- Anthropic → `--eval-type anthropic_api`
- LiteLLM multi-provider → `--eval-type litellm --model provider/name`
- OpenAI Responses API → `--eval-type openai_responses_api`
- Pipeline test → `--model mock --eval-type mock_llm`
- Image generation → `--eval-type text2image`
- TTS → `--eval-type text2speech`
- Image editing → `--eval-type image_editing`
- User wants **performance test** (throughput / latency / QPS / 压测)
- → `evalscope perf` workflow
- API types: `openai` (default), `local`, `local_vllm`, `dashscope`, `embedding`, `rerank`, `custom`
- User wants **RAG evaluation** (RAG / embedding quality / retrieval)
- → `evalscope eval --eval-backend RAGEval` with tool config
- User wants **visualization** (view / compare / dashboard)
- → `evalscope service`
- User wants **benchmark info** (list / find / what benchmarks / 有哪些评测集)
- → `evalscope benchmark-info`
## Workflow 1: Eval (Accuracy)
Core command pattern:
```bash
# Local checkpoint
evalscope eval --model Qwen/Qwen2.5-0.5B-Instruct --datasets gsm8k --limit 10
# API endpoint (auto-detects openai_api when --api-url is set)
evalscope eval --model qwen-plus --datasets gsm8k arc \
--api-url http://localhost:8000/v1/chat/completions --api-key sk-xxx --limit 10
# Anthropic
evalscope eval --model claude-3-5-sonnet --eval-type anthropic_api --datasets mmlu --api-key sk-ant-xxx
```
Key parameters: `--datasets`, `--limit`, `--generation-config`, `--dataset-args`, `--eval-backend`, `--judge-strategy`. For full parameter list → [eval-reference.md](eval-reference.md).
Output: `outputs/<timestamp>/reports/*.json` (scores), `report.html` (summary).
## Workflow 2: Perf (Stress Test)
Core command pattern:
```bash
# Basic throughput test
evalscope perf --model qwen-plus \
--url http://localhost:8000/v1/chat/completions --api openai \
--dataset openqa --parallel 5 --number 200 --stream
# Concurrency gradient (--parallel and --number must pair)
evalscope perf --model qwen-plus --url http://localhost:8000/v1/chat/completions \
--api openai --parallel 1 5 10 20 --number 50 250 500 1000 --stream
# Embedding model
evalscope perf --model text-embedding-v3 --url http://localhost:8000/v1/embeddings \
--api embedding --parallel 10 --number 500
# Rerank model
evalscope perf --model bge-reranker --url http://localhost:8000/v1/rerank \
--api rerank --parallel 5 --number 200
```
Key parameters: `--parallel`, `--number`, `--dataset`, `--max-tokens`, `--sla-auto-tune`. For full parameter list → [perf-reference.md](perf-reference.md).
Output: console table (TTFT/TPOT/throughput p50-p99) + HTML report.
## Workflow 3: RAG Evaluation
Uses `--eval-backend RAGEval` with a Python dict/YAML config. Three tools: RAGAS, MTEB, clip_benchmark.
```python
from evalscope import run_task
run_task({
'eval_backend': 'RAGEval',
'eval_config': {
'tool': 'MTEB', # or 'RAGAS' or 'clip_benchmark'
...
}
})
```
For config schemas and examples → [rag-reference.md](rag-reference.md).
## Workflow 4: Visualization
```bash
evalscope service --host 0.0.0.0 --port 9000 --outputs ./outputs
```
Options: `--host` (default 0.0.0.0), `--port` (default 9000), `--outputs PATH` (scan dir), `--debug`.
Requires: `pip install 'evalscope[service]'`.
## Workflow 5: Benchmark Discovery
```bash
evalscope benchmark-info --list # all benchmarks
evalscope benchmark-info --list --tag Math Coding # filter by tags (OR, case-insensitive)
evalscope benchmark-info gsm8k # text summary
evalscope benchmark-info gsm8k --format json # structured JSON
evalscope benchmark-info gsm8k --format markdown # full docs
```
## Workflow 6: Sandbox Evaluation
For code-execution benchmarks (HumanEval, MBPP, etc.) with Docker isolation:
```bash
evalscope eval --model qwen-plus --datasets humaneval \
--api-url http://localhost:8000/v1/chat/completions \
--sandbox '{"enabled": true, "type": "docker"}'
```
Requires Docker daemon running. See `evalscope eval --help` for `--sandbox` schema.
## Quick Lookup Table
For up-to-date results: `evalscope benchmark-info --list --tag <TAG>`
| User Need | Tags | Typical Benchmarks |
|-----------|------|--------------------|
| Math / reasoning | Math, Reasoning | gsm8k, math_500, aime24, competition_math |
| Coding | Coding | humaneval, mbpp, live_code_bench |
| General knowledge | Knowledge, MCQ | mmlu, ceval, cmmlu, mmlu_pro |
| Chinese | Chinese | ceval, cmmlu, chinese_simpleqa |
| Multimodal / vision | MultiModal | mmmu, mm_bench, math_vista |
| Instruction following | InstructionFollowing | ifeval, multi_if |
| Function calling | FunctionCalling | bfcl_v3, bfcl_v4 |
| Long context | LongContext | needle_haystack, longbench_v2 |
| Agent | Agent | tau_bench |
Common suites:
- **General LLM**: `mmlu gsm8k bbh humaneval ifeval`
- **Chinese**: `ceval cmmlu chinese_simpleqa`
- **Multimodal**: `mmmu mm_bench math_vista mm_star`
## General Notes
- Always `--limit 5` for first-run validation
- Default output: `./outputs/<timestamp>/`
- Long runs: background + `tail -f outputs/<timestamp>/logs/eval_log.log`
- Resume interrupted runs: `--use-cache outputs/<previous_timestamp>`
- Full parameter help: `evalscope eval --help` / `evalscope perf --help`
- On errors → [troubleshooting.md](troubleshooting.md)
don't have the plugin yet? install it then click "run inline in claude" again.