>
Eval-Driven Development for Python LLM Applications You're building an automated evaluation pipeline that tests a Python-based AI application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via pixie test. What you're testing is the app itself — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of assertEqual — but the thing under test is the app's code, not the LLM. During evaluation, the app's own code runs for real — routing, prompt assembly, LLM calls, response formatting — nothing is mocked or stubbed. But the data the app reads from external sources (databases, caches, third-party APIs, voice streams) is replaced with test-specified values via instrumentations. This means each test case controls exactly what data the app sees, while still exercising the full application code path. Rule: The app's LLM calls must go to a real LLM. Do not replace, mock, stub, or intercept the LLM with a fake implementation. The LLM is the core value-generating component — replacing it makes the eval tautological (you control both inputs and outputs, so scores are meaningless). If the project's test suite contains LLM mocking patterns, those are for the project's own unit tests — do NOT adopt them for the eval Runnable. The deliverable is a working pixie test run with real scores — not a plan, not just instrumentation, not just a dataset. This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline. Before you start
don't have the plugin yet? install it then click "run inline in claude" again.
by @clawhub
added explicit inputs for instrumentation and evaluators, structured procedure with step-by-step inputs/outputs, expanded decision points with api costs and conflict handling, defined output contract with pixie report format and exit codes, clarified outcome signal as regression test capability.
---
name: eval-driven-dev
slug: eval-driven-dev
description: build an end-to-end evaluation pipeline for python llm applications using real llm calls and instrumented test data
source: skills.sh
original_author: github
---
build an automated evaluation pipeline that tests your python-based ai application end-to-end by running it the same way a real user would, then scoring outputs with evaluators and producing pass/fail results via pixie test. the app's own code path runs for real (routing, prompt assembly, llm calls, response formatting), but external data sources (databases, caches, third-party apis, voice streams) are replaced with test-controlled values via instrumentation. your llm calls must hit a real llm, not a mock. the deliverable is a working pixie test run with real scores, not a plan.
pip install pixie or your project's dependencies).OPENAI_API_KEY, ANTHROPIC_API_KEY) for the app's llm calls and, if using llm-as-judge, for the evaluator's llm calls. each may require different keys if using multiple providers.audit the app's request path: read the app's entry point (e.g., main handler function, api endpoint). trace how it gathers context (databases, caches, apis), assembles prompts, calls the llm, and formats responses. document each external dependency (database queries, api calls, config reads, file reads, etc.).
design instrumentation: for each external dependency, write a wrapper or monkey-patch that redirects it to a test double. the test double returns test-controlled values keyed by (function, input_params). do not instrument the llm calls.
create test dataset: write test cases as json, yaml, or python dicts. each case specifies (a) input to the app (user query, conversation history, request payload), (b) instrumentation values (what the database query returns, what the api returns), and (c) acceptance criteria (e.g., "response length between 50 and 200 tokens", "mentions product X", "similarity to golden response > 0.8").
tests/eval_cases.json).choose evaluators: decide what scores each output gets. for each test case, pick:
write the pixie test: create a python file (e.g., tests/test_eval.py) that:
@pixie.eval, assert score > 0.7).run the pipeline: execute pixie test and collect results.
USE_TEST_DOUBLES=1) that the app reads to route to test doubles instead. do not mock the app's main llm calls.timeout_seconds=30). if timeouts are frequent, reduce dataset size or check for infinite loops in the app's code.results/eval_run_20240115_143022/) for debugging.