Autonomous experiment loop that optimizes any file by a measurable metric. Inspired by Karpathy's autoresearch. The agent edits a target file, runs a fixed e...
---
name: "autoresearch-agent"
description: "Autonomous experiment loop that optimizes any file by a measurable metric. Inspired by Karpathy's autoresearch. The agent edits a target file, runs a fixed evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when: user wants to optimize code speed, reduce bundle/image size, improve test pass rate, optimize prompts, improve content quality (headlines, copy, CTR), or run any measurable improvement loop. Requires: a target file, an evaluation command that outputs a metric, and a git repo."
license: MIT
metadata:
version: 2.0.0
author: Alireza Rezvani
category: engineering
updated: 2026-03-13
---
# Autoresearch Agent
> You sleep. The agent experiments. You wake up to results.
Autonomous experiment loop inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.
Not one guess — fifty measured attempts, compounding.
---
## Slash Commands
| Command | What it does |
|---------|-------------|
| `/ar:setup` | Set up a new experiment interactively |
| `/ar:run` | Run a single experiment iteration |
| `/ar:loop` | Start autonomous loop with configurable interval (10m, 1h, daily, weekly, monthly) |
| `/ar:status` | Show dashboard and results |
| `/ar:resume` | Resume a paused experiment |
---
## When This Skill Activates
Recognize these patterns from the user:
- "Make this faster / smaller / better"
- "Optimize [file] for [metric]"
- "Improve my [headlines / copy / prompts]"
- "Run experiments overnight"
- "I want to get [metric] from X to Y"
- Any request involving: optimize, benchmark, improve, experiment loop, autoresearch
If the user describes a target file + a way to measure success → this skill applies.
---
## Setup
### First Time — Create the Experiment
Run the setup script. The user decides where experiments live:
**Project-level** (inside repo, git-tracked, shareable with team):
```bash
python scripts/setup_experiment.py \
--domain engineering \
--name api-speed \
--target src/api/search.py \
--eval "pytest bench.py --tb=no -q" \
--metric p50_ms \
--direction lower \
--scope project
```
**User-level** (personal, in `~/.autoresearch/`):
```bash
python scripts/setup_experiment.py \
--domain marketing \
--name medium-ctr \
--target content/titles.md \
--eval "python evaluate.py" \
--metric ctr_score \
--direction higher \
--evaluator llm_judge_content \
--scope user
```
The `--scope` flag determines where `.autoresearch/` lives:
- `project` (default) → `.autoresearch/` in the repo root. Experiment definitions are git-tracked. Results are gitignored.
- `user` → `~/.autoresearch/` in the home directory. Everything is personal.
### What Setup Creates
```
.autoresearch/
├── config.yaml ← Global settings
├── .gitignore ← Ignores results.tsv, *.log
└── {domain}/{experiment-name}/
├── program.md ← Objectives, constraints, strategy
├── config.cfg ← Target, eval cmd, metric, direction
├── results.tsv ← Experiment log (gitignored)
└── evaluate.py ← Evaluation script (if --evaluator used)
```
**results.tsv columns:** `commit | metric | status | description`
- `commit` — short git hash
- `metric` — float value or "N/A" for crashes
- `status` — keep | discard | crash
- `description` — what changed or why it crashed
### Domains
| Domain | Use Cases |
|--------|-----------|
| `engineering` | Code speed, memory, bundle size, test pass rate, build time |
| `marketing` | Headlines, social copy, email subjects, ad copy, engagement |
| `content` | Article structure, SEO descriptions, readability, CTR |
| `prompts` | System prompts, chatbot tone, agent instructions |
| `custom` | Anything else with a measurable metric |
### If `program.md` Already Exists
The user may have written their own `program.md`. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.
---
## Agent Protocol
You are the loop. The scripts handle setup and evaluation — you handle the creative work.
### Before Starting
1. Read `.autoresearch/{domain}/{name}/config.cfg` to get:
- `target` — the file you edit
- `evaluate_cmd` — the command that measures your changes
- `metric` — the metric name to look for in eval output
- `metric_direction` — "lower" or "higher" is better
- `time_budget_minutes` — max time per evaluation
2. Read `program.md` for strategy, constraints, and what you can/cannot change
3. Read `results.tsv` for experiment history (columns: commit, metric, status, description)
4. Checkout the experiment branch: `git checkout autoresearch/{domain}/{name}`
### Each Iteration
1. Review results.tsv — what worked? What failed? What hasn't been tried?
2. Decide ONE change to the target file. One variable per experiment.
3. Edit the target file
4. Commit: `git add {target} && git commit -m "experiment: {description}"`
5. Evaluate: `python scripts/run_experiment.py --experiment {domain}/{name} --single`
6. Read the output — it prints KEEP, DISCARD, or CRASH with the metric value
7. Go to step 1
### What the Script Handles (you don't)
- Running the eval command with timeout
- Parsing the metric from eval output
- Comparing to previous best
- Reverting the commit on failure (`git reset --hard HEAD~1`)
- Logging the result to results.tsv
### Starting an Experiment
```bash
# Single iteration (the agent calls this repeatedly)
python scripts/run_experiment.py --experiment engineering/api-speed --single
# Dry run (test setup before starting)
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
```
### Strategy Escalation
- Runs 1-5: Low-hanging fruit (obvious improvements, simple optimizations)
- Runs 6-15: Systematic exploration (vary one parameter at a time)
- Runs 16-30: Structural changes (algorithm swaps, architecture shifts)
- Runs 30+: Radical experiments (completely different approaches)
- If no improvement in 20+ runs: update program.md Strategy section
### Self-Improvement
After every 10 experiments, review results.tsv for patterns. Update the
Strategy section of program.md with what you learned (e.g., "caching changes
consistently improve by 5-10%", "refactoring attempts never improve the metric").
Future iterations benefit from this accumulated knowledge.
### Stopping
- Run until interrupted by the user, context limit reached, or goal in program.md is met
- Before stopping: ensure results.tsv is up to date
- On context limit: the next session can resume — results.tsv and git log persist
### Rules
- **One change per experiment.** Don't change 5 things at once. You won't know what worked.
- **Simplicity criterion.** A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
- **Never modify the evaluator.** `evaluate.py` is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
- **Timeout.** If a run exceeds 2.5× the time budget, kill it and treat as crash.
- **Crash handling.** If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
- **No new dependencies.** Only use what's already available in the project.
---
## Evaluators
Ready-to-use evaluation scripts. Copied into the experiment directory during setup with `--evaluator`.
### Free Evaluators (no API cost)
| Evaluator | Metric | Use Case |
|-----------|--------|----------|
| `benchmark_speed` | `p50_ms` (lower) | Function/API execution time |
| `benchmark_size` | `size_bytes` (lower) | File, bundle, Docker image size |
| `test_pass_rate` | `pass_rate` (higher) | Test suite pass percentage |
| `build_speed` | `build_seconds` (lower) | Build/compile/Docker build time |
| `memory_usage` | `peak_mb` (lower) | Peak memory during execution |
### LLM Judge Evaluators (uses your subscription)
| Evaluator | Metric | Use Case |
|-----------|--------|----------|
| `llm_judge_content` | `ctr_score` 0-10 (higher) | Headlines, titles, descriptions |
| `llm_judge_prompt` | `quality_score` 0-100 (higher) | System prompts, agent instructions |
| `llm_judge_copy` | `engagement_score` 0-10 (higher) | Social posts, ad copy, emails |
LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside `evaluate.py` — the agent cannot modify it. This prevents the agent from gaming its own evaluator.
The user's existing subscription covers the cost:
- Claude Code Max → unlimited Claude calls for evaluation
- Codex CLI (ChatGPT Pro) → unlimited Codex calls
- Gemini CLI (free tier) → free evaluation calls
### Custom Evaluators
If no built-in evaluator fits, the user writes their own `evaluate.py`. Only requirement: it must print `metric_name: value` to stdout.
```python
#!/usr/bin/env python3
# My custom evaluator — DO NOT MODIFY after experiment starts
import subprocess
result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)
# Parse and output
print(f"my_metric: {parse_score(result.stdout)}")
```
---
## Viewing Results
```bash
# Single experiment
python scripts/log_results.py --experiment engineering/api-speed
# All experiments in a domain
python scripts/log_results.py --domain engineering
# Cross-experiment dashboard
python scripts/log_results.py --dashboard
# Export formats
python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
python scripts/log_results.py --dashboard --format markdown --output dashboard.md
```
### Dashboard Output
```
DOMAIN EXPERIMENT RUNS KEPT BEST Δ FROM START STATUS
engineering api-speed 47 14 185ms -76.9% active
engineering bundle-size 23 8 412KB -58.3% paused
marketing medium-ctr 31 11 8.4/10 +68.0% active
prompts support-tone 15 6 82/100 +46.4% done
```
### Export Formats
- **TSV** — default, tab-separated (compatible with spreadsheets)
- **CSV** — comma-separated, with proper quoting
- **Markdown** — formatted table, readable in GitHub/docs
---
## Proactive Triggers
Flag these without being asked:
- **No evaluation command works** → Test it before starting the loop. Run once, verify output.
- **Target file not in git** → `git init && git add . && git commit -m 'initial'` first.
- **Metric direction unclear** → Ask: is lower or higher better? Must know before starting.
- **Time budget too short** → If eval takes longer than budget, every run crashes.
- **Agent modifying evaluate.py** → Hard stop. This invalidates all comparisons.
- **5 consecutive crashes** → Pause the loop. Alert the user. Don't keep burning cycles.
- **No improvement in 20+ runs** → Suggest changing strategy in program.md or trying a different approach.
---
## Installation
### One-liner (any tool)
```bash
git clone https://github.com/alirezarezvani/claude-skills.git
cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/
```
### Multi-tool install
```bash
./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw
```
### OpenClaw
```bash
clawhub install cs-autoresearch-agent
```
---
## Related Skills
- **self-improving-agent** — improves an agent's own memory/rules over time. NOT for structured experiment loops.
- **senior-ml-engineer** — ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization.
- **tdd-guide** — test-driven development. Complementary — tests can be the evaluation function.
- **skill-security-auditor** — audit skills before publishing. NOT for optimization loops.
don't have the plugin yet? install it then click "run inline in claude" again.
added explicit intent, inputs with external connection details, decision points for error cases and branching logic, output contract specifying file formats and structure, and outcome signal defining success criteria. preserved original procedure faithfully and added edge case handling for auth expiry, network timeouts, and evaluation failures.
you sleep. the agent experiments. you wake up to results.
autonomous experiment loop inspired by Karpathy's autoresearch. the agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.
not one guess , fifty measured attempts, compounding.
use this skill when you want to optimize any file (code, prompts, content, config) by a measurable metric without manual guessing. the agent autonomously edits the target file, evaluates changes against a baseline, commits improvements via git, reverts failures, and repeats. works for code speed, bundle size, test pass rate, prompt quality, content CTR, or any domain with a quantifiable metric and an evaluation command. activate this when you say "make this faster," "optimize [file] for [metric]," "improve my headlines," or "run experiments overnight." requires a git repo, a target file, and an evaluation command that outputs a metric value.
src/api/search.py, content/titles.md, prompts/system.txt.metric_name: value to stdout. examples: pytest bench.py --tb=no -q, python evaluate.py, npm run build --stats.p50_ms, size_bytes, pass_rate, ctr_score..autoresearch/{domain}/{name}/program.md, the agent reads it. if not provided, the agent asks for clarity on what can/cannot change.evaluate.py..autoresearch/ in repo root (git-tracked, shareable). "user" stores in ~/.autoresearch/ (personal, isolated).git checkout, git add, git commit, git reset --hard.git init && git add . && git commit -m 'initial' if repo is new..autoresearch/{domain}/{experiment_name}/ in git repo root.~/.autoresearch/{domain}/{experiment_name}/.commit | metric | status | description. no rows yet.evaluate.py (benchmark_speed, llm_judge_content, etc.) into experiment dir. mark as read-only (chmod 444 on unix).git checkout -b autoresearch/{domain}/{name}. commit empty results.tsv and config files to this branch.input for each iteration: experiment name (domain/name), user's implicit request to run once (--single flag) or continuous loop (--loop flag).
load config and history:
.autoresearch/{domain}/{name}/config.cfg into memory (target, evaluate_cmd, metric, metric_direction, time_budget)..autoresearch/{domain}/{name}/program.md for strategy and constraints..autoresearch/{domain}/{name}/results.tsv into memory. parse rows as list of (commit_hash, metric_value, status, description).ensure on correct branch: git checkout autoresearch/{domain}/{name}. verify HEAD matches latest commit hash in results.tsv (if results.tsv is non-empty).
decide next change:
edit target_file: make the single change. do not modify evaluate.py or results.tsv.
commit change: git add {target_file} && git commit -m "experiment: {description of change}". capture commit hash.
run evaluation:
metric: value pattern. extract value as float. if metric not found in output, log "crash (metric not found)", log metric as "N/A", skip to step 10.compare to baseline (if results.tsv has any previous "keep" rows):
handle status:
git reset --hard HEAD~1. log to results.tsv (this row records the discarded attempt, metric, and description for learning).git reset --hard HEAD~1. log to results.tsv. count consecutive crashes (crashes in last 5 iterations). if 5+ consecutive crashes, halt and alert user with message "5 consecutive crashes detected. pause and review strategy in program.md or try different approach."log result: append row to results.tsv: {commit_hash} | {metric_value} | {status} | {description}.
self-improvement check: if total rows in results.tsv is multiple of 10 (i.e., every 10 experiments), scan results.tsv for patterns. examples: "all cache-related changes keep", "refactoring attempts never help metric", "simple parameter tweaks most reliable". append findings to program.md under "Learned Patterns" section. this informs future iterations.
decide next action:
log_results.py with flags like --experiment, --domain, --dashboard, --format, --output.if evaluate_cmd fails on baseline: halt setup. output error, ask user to debug eval command offline and retry setup.
if metric_direction not specified: ask explicitly: "is lower or higher better for {metric}?" no default. must have answer before loop starts.
if target_file not in git: halt before first commit. output: "target_file must be in a git repo. run git init && git add . && git commit -m 'initial' first."
if results.tsv shows no improvement in last 20 runs: suggest (do not force) reviewing and updating strategy section of program.md. offer to continue or pivot to new approach.
if agent attempts to modify evaluate.py: hard stop. output: "evaluate.py is locked. modifying it invalidates all comparisons. revert and try different target_file change." do not commit the change.
if 5 consecutive crashes occur: halt loop. alert user: "5 consecutive crashes. likely fundamental issue with target or eval setup. pause, debug, resume when ready." provide recent crash descriptions to help diagnose.
if eval takes longer than time_budget_minutes: log timeout crash on first few runs. proactive: on setup, measure eval time on baseline. if baseline eval is slower than time_budget, suggest increasing time_budget.
if llm_judge_ evaluator called but api key missing or expired*: eval crashes with auth error. agent logs crash. alert user to refresh api key in env var.
if experiment branch conflicts with existing branch: warn user: "branch autoresearch/{domain}/{name} already exists and may have diverged. reset to main and restart, or resume existing experiment?" user chooses.
if results.tsv is corrupted or unreadable: agent falls back to empty history, logs warning, continues from run 1.
.autoresearch/{domain}/{name}/ (project scope) or ~/.autoresearch/{domain}/{name}/ (user scope).commit | metric | status | description. initially empty (header only).autoresearch/{domain}/{name} created and checked out. initial commit recorded.{git_commit_hash} | {metric_value_or_NA} | {keep|discard|crash} | {human_description}.success criteria for user to confirm skill worked:
setup succeeded: .autoresearch/{domain}/{name}/ directory exists, config.cfg and program.md are readable, results.tsv header is present, no errors during setup.
first iteration ran: results.tsv has at least one data row (not just header). status is "keep", "discard", or "crash", metric_value is numeric or "N/A".
improvements compound: after 10+ iterations, best_metric has moved in the metric_direction (lower for latency, higher for quality). ideally 5-10% improvement per 10 runs.
git history is clean: git log autoresearch/{domain}/{name} shows only kept commits. discarded and crashed runs do not pollute history.
loop continues autonomously: if --loop flag active, agent wakes every interval (10m, 1h, daily, etc.) and runs new iteration without user intervention.
self-improvement visible: after 20+ iterations, program.md "Learned Patterns" section contains actionable insights (e.g., "parameter X always helps", "avoid structural changes, simple tweaks work better").
no manual intervention needed: agent handles crashes, reverts, and logging without user restarting or debugging each run.
results exportable and readable: python scripts/log_results.py --experiment {domain}/{name} --format markdown produces human-friendly table. dashboard view shows progress across multiple experiments.
resumable state: if process stops (context limit, user pause), next session resumes from results.tsv state. no runs are lost or repeated.
credits: original author Alireza Rezvani, inspired by Karpathy's autoresearch. enriched for Implexa quality standards.