eval-driven-dev

Item: eval-driven-dev
Rating: 7.2
Author: Implexa

Improve AI application with evaluation-driven development. Define eval criteria, instrument the application, build golden datasets, observe and evaluate…

view source

installs

stars

karma

SkillRank score ↗

7.2/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

eval-driven-dev establishes a pattern for testing python llm applications end-to-end by running real code paths with instrumented external data and real llm calls, scoring outputs via evaluators rather than assertions.

structure

8.0

trigger phrases

6.0

procedure

7.0

edge cases

6.0

documentation

8.0

strengths

view original SKILL.md from skills.shclick to expand

Eval-Driven Development for Python LLM Applications

You're building an automated evaluation pipeline that tests a Python-based AI application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via pixie test.

What you're testing is the app itself — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of assertEqual — but the thing under test is the app's code, not the LLM.

During evaluation, the app's own code runs for real — routing, prompt assembly, LLM calls, response formatting — nothing is mocked or stubbed. But the data the app reads from external sources (databases, caches, third-party APIs, voice streams) is replaced with test-specified values via instrumentations. This means each test case controls exactly what data the app sees, while still exercising the full application code path.

Rule: The app's LLM calls must go to a real LLM. Do not replace, mock, stub, or intercept the LLM with a fake implementation. The LLM is the core value-generating component — replacing it makes the eval tautological (you control both inputs and outputs, so scores are meaningless). If the project's test suite contains LLM mocking patterns, those are for the project's own unit tests — do NOT adopt them for the eval Runnable.

The deliverable is a working pixie test run with real scores — not a plan, not just instrumentation, not just a dataset.

This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline.

Before you start

related skills

semantically similar in the cross-vendor index

clawhub

76% match

Eval Driven Development

Add instrumentation, build golden datasets, write eval-based tests, run them, root-cause failures, and iterate — Ensure your Python LLM application works cor...

don't have the plugin yet? install it then click "run inline in claude" again.

added explicit inputs for instrumentation and evaluators, structured procedure with step-by-step inputs/outputs, expanded decision points with api costs and conflict handling, defined output contract with pixie report format and exit codes, clarified outcome signal as regression test capability.

---
name: eval-driven-dev
slug: eval-driven-dev
description: build an end-to-end evaluation pipeline for python llm applications using real llm calls and instrumented test data
source: skills.sh
original_author: github
---

intent

build an automated evaluation pipeline that tests your python-based ai application end-to-end by running it the same way a real user would, then scoring outputs with evaluators and producing pass/fail results via pixie test. the app's own code path runs for real (routing, prompt assembly, llm calls, response formatting), but external data sources (databases, caches, third-party apis, voice streams) are replaced with test-controlled values via instrumentation. your llm calls must hit a real llm, not a mock. the deliverable is a working pixie test run with real scores, not a plan.

inputs

python application: a working llm-based app with request handling, context assembly, routing, and response formatting. the app must be importable and runnable in your test environment.
instrumentation layer: code that intercepts and replaces external data sources (database reads, cache lookups, api calls, voice input) with test-specified values. this isolates the app's logic from external dependencies without mocking the llm.
test dataset: a set of test cases, each with expected inputs and one or more acceptable output patterns or quality criteria. format as json, yaml, or python dict.
evaluators: scoring functions that assess app outputs. options include:
- llm-as-judge: a separate llm call that scores outputs against criteria (requires api key and budget for eval calls).
- similarity scores: embedding-based or string-based comparison.
- custom functions: deterministic logic for output validation.
pixie test framework: installed and configured (via pip install pixie or your project's dependencies).
llm api credentials: env vars (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY) for the app's llm calls and, if using llm-as-judge, for the evaluator's llm calls. each may require different keys if using multiple providers.
optional: external service test doubles: mock or stub endpoints for services you cannot or should not call during test (e.g., payment processors, rate-limited third-party apis). use environment variable flags to route the app to test doubles during eval.

procedure

audit the app's request path: read the app's entry point (e.g., main handler function, api endpoint). trace how it gathers context (databases, caches, apis), assembles prompts, calls the llm, and formats responses. document each external dependency (database queries, api calls, config reads, file reads, etc.).
- input: app source code.
- output: a dependency map listing each external call by function/line number and the data it reads.
design instrumentation: for each external dependency, write a wrapper or monkey-patch that redirects it to a test double. the test double returns test-controlled values keyed by (function, input_params). do not instrument the llm calls.
- input: dependency map from step 1.
- output: instrumentation module with a setup function that activates all patches for the current test.
create test dataset: write test cases as json, yaml, or python dicts. each case specifies (a) input to the app (user query, conversation history, request payload), (b) instrumentation values (what the database query returns, what the api returns), and (c) acceptance criteria (e.g., "response length between 50 and 200 tokens", "mentions product X", "similarity to golden response > 0.8").
- input: app's expected inputs and outputs, domain knowledge.
- output: test dataset file (e.g., tests/eval_cases.json).
choose evaluators: decide what scores each output gets. for each test case, pick:
- llm-as-judge with a scoring prompt (e.g., "rate this response 1-5 for helpfulness"), or
- similarity scorer (embedding or string distance), or
- custom function (regex match, token count check, deterministic logic).
- input: acceptance criteria from test dataset.
- output: evaluator functions (python callables that return float 0-1 or dict with scores).
write the pixie test: create a python file (e.g., tests/test_eval.py) that:
- imports the app, instrumentation, test dataset, and evaluators.
- for each test case, activates instrumentation with test-controlled data, calls the app with the test input, runs evaluators on the output, and asserts scores meet thresholds.
- uses pixie's test decorators and assertions (e.g., @pixie.eval, assert score > 0.7).
- input: app, instrumentation, test dataset, evaluators, pixie framework.
- output: test file with test functions.
run the pipeline: execute pixie test and collect results.
- input: test file, llm api credentials (env vars), instrumentation (active).
- output: pixie test report with pass/fail, scores for each case, latencies, llm token usage.

decision points

if the app calls multiple llms (openai, anthropic, etc.): set up env vars for each. document which llm is used in which code path so you know which credentials are needed and can budget separately.
if external services are rate-limited or expensive: use mock endpoints or test doubles for those services during eval. set an env var (e.g., USE_TEST_DOUBLES=1) that the app reads to route to test doubles instead. do not mock the app's main llm calls.
if the evaluator is llm-as-judge and you're running many test cases: expect llm eval costs. budget accordingly and consider sampling test cases or caching evaluator results if running the same output multiple times.
if test cases time out: check if the app or llm calls are hanging. add a timeout wrapper to app calls (e.g., timeout_seconds=30). if timeouts are frequent, reduce dataset size or check for infinite loops in the app's code.
if instrumentation patches conflict (e.g., two patches both override the same function): refactor instrumentation to a single patch that multiplexes test-controlled values by input. alternatively, apply patches in a specific order or use context managers to avoid overlaps.
if evaluator scores are mostly 0 or 1 (no variance): the evaluator is too strict or too loose. tweak thresholds or evaluator logic. llm-as-judge scores should vary across test cases; if they don't, the scoring prompt may be unclear.
if the app's llm calls fail (auth, rate limits, network): the eval halts. ensure api credentials are valid and have quota. add retry logic with exponential backoff to the app if not already present. consider a fallback to a cheaper/faster llm for dev iteration, then run final evals on the target llm.

output contract

pixie test report: human-readable output (stdout or file) showing:
- test case id, app input, app output.
- evaluator scores (float or dict).
- pass/fail status per case (derived from score > threshold).
- aggregate stats: total cases, passed, failed, avg score, p50/p95 latencies.
- llm usage: total tokens (input/output) for app calls and eval calls, total cost (if prices are known).
optional: test artifacts: save app outputs, evaluator details, and logs to a directory (e.g., results/eval_run_20240115_143022/) for debugging.
exit code: 0 if all test cases pass, non-zero (e.g., 1) if any fail.

outcome signal

pixie test completes without errors. all test cases either pass (score > threshold) or fail with a clear reason (low score, app error, eval error, timeout).
you can re-run the same test cases and get consistent results (modulo randomness in llm outputs, which evaluators should absorb).
llm token usage and latencies are reasonable for the app's scale and acceptable as a baseline for future iterations.
if you change the app's code (e.g., modify the prompt, add a new routing rule), re-run the eval and confirm scores improve or degrade as expected. the eval is now a regression test.