Autoresearch Agent

Autonomous experiment loop that optimizes any file by a measurable metric. Inspired by Karpathy's autoresearch. The agent edits a target file, runs a fixed e...

view source

installs

stars

karma

SkillRank score ↗

7.8/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

autoresearch-agent runs autonomous experiment loops that iteratively edit a target file, measure outcomes against a fixed metric, and retain only improvements via git. supports engineering (speed, size, tests), marketing (copy, ctr), and custom domains with built-in and user-defined evaluators.

structure

9.0

trigger phrases

8.0

procedure

8.0

edge cases

7.0

documentation

8.0

view original SKILL.md from clawhubclick to expand

---
name: "autoresearch-agent"
description: "Autonomous experiment loop that optimizes any file by a measurable metric. Inspired by Karpathy's autoresearch. The agent edits a target file, runs a fixed evaluation, keeps improvements (git commit), discards failures (git reset), and loops indefinitely. Use when: user wants to optimize code speed, reduce bundle/image size, improve test pass rate, optimize prompts, improve content quality (headlines, copy, CTR), or run any measurable improvement loop. Requires: a target file, an evaluation command that outputs a metric, and a git repo."
license: MIT
metadata:
  version: 2.0.0
  author: Alireza Rezvani
  category: engineering
  updated: 2026-03-13
---

# Autoresearch Agent

> You sleep. The agent experiments. You wake up to results.

Autonomous experiment loop inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.

Not one guess — fifty measured attempts, compounding.

---

## Slash Commands

| Command | What it does |
|---------|-------------|
| `/ar:setup` | Set up a new experiment interactively |
| `/ar:run` | Run a single experiment iteration |
| `/ar:loop` | Start autonomous loop with configurable interval (10m, 1h, daily, weekly, monthly) |
| `/ar:status` | Show dashboard and results |
| `/ar:resume` | Resume a paused experiment |

---

## When This Skill Activates

Recognize these patterns from the user:

- "Make this faster / smaller / better"
- "Optimize [file] for [metric]"
- "Improve my [headlines / copy / prompts]"
- "Run experiments overnight"
- "I want to get [metric] from X to Y"
- Any request involving: optimize, benchmark, improve, experiment loop, autoresearch

If the user describes a target file + a way to measure success → this skill applies.

---

## Setup

### First Time — Create the Experiment

Run the setup script. The user decides where experiments live:

**Project-level** (inside repo, git-tracked, shareable with team):
```bash
python scripts/setup_experiment.py \
  --domain engineering \
  --name api-speed \
  --target src/api/search.py \
  --eval "pytest bench.py --tb=no -q" \
  --metric p50_ms \
  --direction lower \
  --scope project
```

**User-level** (personal, in `~/.autoresearch/`):
```bash
python scripts/setup_experiment.py \
  --domain marketing \
  --name medium-ctr \
  --target content/titles.md \
  --eval "python evaluate.py" \
  --metric ctr_score \
  --direction higher \
  --evaluator llm_judge_content \
  --scope user
```

The `--scope` flag determines where `.autoresearch/` lives:
- `project` (default) → `.autoresearch/` in the repo root. Experiment definitions are git-tracked. Results are gitignored.
- `user` → `~/.autoresearch/` in the home directory. Everything is personal.

### What Setup Creates

```
.autoresearch/
├── config.yaml                        ← Global settings
├── .gitignore                         ← Ignores results.tsv, *.log
└── {domain}/{experiment-name}/
    ├── program.md                     ← Objectives, constraints, strategy
    ├── config.cfg                     ← Target, eval cmd, metric, direction
    ├── results.tsv                    ← Experiment log (gitignored)
    └── evaluate.py                    ← Evaluation script (if --evaluator used)
```

**results.tsv columns:** `commit | metric | status | description`
- `commit` — short git hash
- `metric` — float value or "N/A" for crashes
- `status` — keep | discard | crash
- `description` — what changed or why it crashed

### Domains

| Domain | Use Cases |
|--------|-----------|
| `engineering` | Code speed, memory, bundle size, test pass rate, build time |
| `marketing` | Headlines, social copy, email subjects, ad copy, engagement |
| `content` | Article structure, SEO descriptions, readability, CTR |
| `prompts` | System prompts, chatbot tone, agent instructions |
| `custom` | Anything else with a measurable metric |

### If `program.md` Already Exists

The user may have written their own `program.md`. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.

---

## Agent Protocol

You are the loop. The scripts handle setup and evaluation — you handle the creative work.

### Before Starting
1. Read `.autoresearch/{domain}/{name}/config.cfg` to get:
   - `target` — the file you edit
   - `evaluate_cmd` — the command that measures your changes
   - `metric` — the metric name to look for in eval output
   - `metric_direction` — "lower" or "higher" is better
   - `time_budget_minutes` — max time per evaluation
2. Read `program.md` for strategy, constraints, and what you can/cannot change
3. Read `results.tsv` for experiment history (columns: commit, metric, status, description)
4. Checkout the experiment branch: `git checkout autoresearch/{domain}/{name}`

### Each Iteration
1. Review results.tsv — what worked? What failed? What hasn't been tried?
2. Decide ONE change to the target file. One variable per experiment.
3. Edit the target file
4. Commit: `git add {target} && git commit -m "experiment: {description}"`
5. Evaluate: `python scripts/run_experiment.py --experiment {domain}/{name} --single`
6. Read the output — it prints KEEP, DISCARD, or CRASH with the metric value
7. Go to step 1

### What the Script Handles (you don't)
- Running the eval command with timeout
- Parsing the metric from eval output
- Comparing to previous best
- Reverting the commit on failure (`git reset --hard HEAD~1`)
- Logging the result to results.tsv

### Starting an Experiment

```bash
# Single iteration (the agent calls this repeatedly)
python scripts/run_experiment.py --experiment engineering/api-speed --single

# Dry run (test setup before starting)
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
```

### Strategy Escalation
- Runs 1-5: Low-hanging fruit (obvious improvements, simple optimizations)
- Runs 6-15: Systematic exploration (vary one parameter at a time)
- Runs 16-30: Structural changes (algorithm swaps, architecture shifts)
- Runs 30+: Radical experiments (completely different approaches)
- If no improvement in 20+ runs: update program.md Strategy section

### Self-Improvement
After every 10 experiments, review results.tsv for patterns. Update the
Strategy section of program.md with what you learned (e.g., "caching changes
consistently improve by 5-10%", "refactoring attempts never improve the metric").
Future iterations benefit from this accumulated knowledge.

### Stopping
- Run until interrupted by the user, context limit reached, or goal in program.md is met
- Before stopping: ensure results.tsv is up to date
- On context limit: the next session can resume — results.tsv and git log persist

### Rules

- **One change per experiment.** Don't change 5 things at once. You won't know what worked.
- **Simplicity criterion.** A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
- **Never modify the evaluator.** `evaluate.py` is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
- **Timeout.** If a run exceeds 2.5× the time budget, kill it and treat as crash.
- **Crash handling.** If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
- **No new dependencies.** Only use what's already available in the project.

---

## Evaluators

Ready-to-use evaluation scripts. Copied into the experiment directory during setup with `--evaluator`.

### Free Evaluators (no API cost)

| Evaluator | Metric | Use Case |
|-----------|--------|----------|
| `benchmark_speed` | `p50_ms` (lower) | Function/API execution time |
| `benchmark_size` | `size_bytes` (lower) | File, bundle, Docker image size |
| `test_pass_rate` | `pass_rate` (higher) | Test suite pass percentage |
| `build_speed` | `build_seconds` (lower) | Build/compile/Docker build time |
| `memory_usage` | `peak_mb` (lower) | Peak memory during execution |

### LLM Judge Evaluators (uses your subscription)

| Evaluator | Metric | Use Case |
|-----------|--------|----------|
| `llm_judge_content` | `ctr_score` 0-10 (higher) | Headlines, titles, descriptions |
| `llm_judge_prompt` | `quality_score` 0-100 (higher) | System prompts, agent instructions |
| `llm_judge_copy` | `engagement_score` 0-10 (higher) | Social posts, ad copy, emails |

LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside `evaluate.py` — the agent cannot modify it. This prevents the agent from gaming its own evaluator.

The user's existing subscription covers the cost:
- Claude Code Max → unlimited Claude calls for evaluation
- Codex CLI (ChatGPT Pro) → unlimited Codex calls
- Gemini CLI (free tier) → free evaluation calls

### Custom Evaluators

If no built-in evaluator fits, the user writes their own `evaluate.py`. Only requirement: it must print `metric_name: value` to stdout.

```python
#!/usr/bin/env python3
# My custom evaluator — DO NOT MODIFY after experiment starts
import subprocess
result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)
# Parse and output
print(f"my_metric: {parse_score(result.stdout)}")
```

---

## Viewing Results

```bash
# Single experiment
python scripts/log_results.py --experiment engineering/api-speed

# All experiments in a domain
python scripts/log_results.py --domain engineering

# Cross-experiment dashboard
python scripts/log_results.py --dashboard

# Export formats
python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
python scripts/log_results.py --dashboard --format markdown --output dashboard.md
```

### Dashboard Output

```
DOMAIN          EXPERIMENT          RUNS  KEPT  BEST         Δ FROM START  STATUS
engineering     api-speed            47    14   185ms        -76.9%        active
engineering     bundle-size          23     8   412KB        -58.3%        paused
marketing       medium-ctr           31    11   8.4/10       +68.0%        active
prompts         support-tone         15     6   82/100       +46.4%        done
```

### Export Formats

- **TSV** — default, tab-separated (compatible with spreadsheets)
- **CSV** — comma-separated, with proper quoting
- **Markdown** — formatted table, readable in GitHub/docs

---

## Proactive Triggers

Flag these without being asked:

- **No evaluation command works** → Test it before starting the loop. Run once, verify output.
- **Target file not in git** → `git init && git add . && git commit -m 'initial'` first.
- **Metric direction unclear** → Ask: is lower or higher better? Must know before starting.
- **Time budget too short** → If eval takes longer than budget, every run crashes.
- **Agent modifying evaluate.py** → Hard stop. This invalidates all comparisons.
- **5 consecutive crashes** → Pause the loop. Alert the user. Don't keep burning cycles.
- **No improvement in 20+ runs** → Suggest changing strategy in program.md or trying a different approach.

---

## Installation

### One-liner (any tool)
```bash
git clone https://github.com/alirezarezvani/claude-skills.git
cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/
```

### Multi-tool install
```bash
./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw
```

### OpenClaw
```bash
clawhub install cs-autoresearch-agent
```

---

## Related Skills

- **self-improving-agent** — improves an agent's own memory/rules over time. NOT for structured experiment loops.
- **senior-ml-engineer** — ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization.
- **tdd-guide** — test-driven development. Complementary — tests can be the evaluation function.
- **skill-security-auditor** — audit skills before publishing. NOT for optimization loops.

related skills

semantically similar in the cross-vendor index

clawhub

80% match

Agent Autoresearch

Karpathy-style autonomous self-research loop for AI agents. The agent proposes a change to its own SOUL.md, scripts, or behavior, tests it, evaluates the res...

don't have the plugin yet? install it then click "run inline in claude" again.

added explicit intent, inputs with external connection details, decision points for error cases and branching logic, output contract specifying file formats and structure, and outcome signal defining success criteria. preserved original procedure faithfully and added edge case handling for auth expiry, network timeouts, and evaluation failures.

Autoresearch Agent

you sleep. the agent experiments. you wake up to results.

autonomous experiment loop inspired by Karpathy's autoresearch. the agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.

not one guess , fifty measured attempts, compounding.

intent

use this skill when you want to optimize any file (code, prompts, content, config) by a measurable metric without manual guessing. the agent autonomously edits the target file, evaluates changes against a baseline, commits improvements via git, reverts failures, and repeats. works for code speed, bundle size, test pass rate, prompt quality, content CTR, or any domain with a quantifiable metric and an evaluation command. activate this when you say "make this faster," "optimize [file] for [metric]," "improve my headlines," or "run experiments overnight." requires a git repo, a target file, and an evaluation command that outputs a metric value.

inputs

required

target_file (string, filepath): the file the agent will edit. must be tracked in git. examples: src/api/search.py, content/titles.md, prompts/system.txt.
evaluate_cmd (string, shell command): the command that measures success. must output a metric value in format metric_name: value to stdout. examples: pytest bench.py --tb=no -q, python evaluate.py, npm run build --stats.
metric (string): the name of the metric to extract from eval output. examples: p50_ms, size_bytes, pass_rate, ctr_score.
metric_direction (enum: "lower" or "higher"): whether the metric improves by going down (lower) or up (higher). examples: lower for latency/size, higher for pass rate/quality score.
git_repo (filepath): root of a git repository. the agent will create branches, commit, and reset within this repo.

optional

program_md (string, filepath or text): human-readable objectives, constraints, and strategy. if file exists at .autoresearch/{domain}/{name}/program.md, the agent reads it. if not provided, the agent asks for clarity on what can/cannot change.
time_budget_minutes (integer, default 5): max seconds allowed per evaluation run. if eval exceeds 2.5x this budget, kill it and log as crash.
evaluator_type (enum: "benchmark_speed", "benchmark_size", "test_pass_rate", "build_speed", "memory_usage", "llm_judge_content", "llm_judge_prompt", "llm_judge_copy", "custom"): which evaluation template to use. if custom, user provides their own evaluate.py.
scope (enum: "project" or "user", default "project"): where to store experiment config and results. "project" stores .autoresearch/ in repo root (git-tracked, shareable). "user" stores in ~/.autoresearch/ (personal, isolated).
domain (string, default "custom"): category for organizing experiments. examples: engineering, marketing, content, prompts, custom.
experiment_name (string): human-readable name for this experiment. examples: api-speed, medium-ctr, bundle-shrink.

external connections

git_cli (required): local git binary (usually present). the agent runs git checkout, git add, git commit, git reset --hard.
python_runtime (required): python 3.8+. the agent runs evaluation scripts and helper utilities.
llm_subscription (conditional, if using llm_judge_* evaluators): existing CLI tool subscription (Claude, Codex, Gemini) that user already runs. evaluation calls the same tool, no additional cost beyond user's current plan. requires cli tool configured with api key in env var (ANTHROPIC_API_KEY for Claude, OPENAI_API_KEY for Codex, GOOGLE_API_KEY for Gemini).

edge cases and validation

target_file not in git: agent will fail on first commit. pre-requisite: run git init && git add . && git commit -m 'initial' if repo is new.
evaluate_cmd fails on baseline: evaluation will crash. agent validates this before starting loop. proactive trigger: test eval command manually once before starting.
metric not found in eval output: agent logs "N/A" and treats run as crash if metric is missing.
network timeout on eval: if eval command makes external requests (api calls, llm judgments) and times out, treated as crash after time_budget_minutes × 2.5.
llm subscription expired or rate-limited: llm_judge_* evaluators will fail. agent logs crash. manual fix: refresh api key in env var or upgrade subscription.
git branch conflict: if experiment branch already exists and has diverged from main, agent warns before proceeding.

procedure

phase 1: setup (one time)

input: user provides target_file, evaluate_cmd, metric, metric_direction, domain, experiment_name, scope, optional program_md.
validate evaluate_cmd: run it once on baseline target_file in a sandbox. capture stdout. check metric value is parseable. if eval fails or metric not found, halt and ask user to fix evaluate_cmd.
create directory structure:
- if scope == "project", create .autoresearch/{domain}/{experiment_name}/ in git repo root.
- if scope == "user", create ~/.autoresearch/{domain}/{experiment_name}/.
- create subdirs: none needed, all files flat.
write config.cfg: save target, evaluate_cmd, metric, metric_direction, time_budget_minutes as key-value pairs.
write or read program.md: if user provided program_md content or file path, use it. else create template with sections: objectives, constraints, strategy, notes. ask user to fill in objectives and constraints if not provided.
write results.tsv header: columns: commit | metric | status | description. no rows yet.
copy evaluator script (if --evaluator flag used): copy ready-made evaluate.py (benchmark_speed, llm_judge_content, etc.) into experiment dir. mark as read-only (chmod 444 on unix).
create git branch: git checkout -b autoresearch/{domain}/{name}. commit empty results.tsv and config files to this branch.
output: print setup complete, directory path, first iteration instructions.

phase 2: experiment loop (agent runs repeatedly, each call is one iteration)

input for each iteration: experiment name (domain/name), user's implicit request to run once (--single flag) or continuous loop (--loop flag).
load config and history:
- read .autoresearch/{domain}/{name}/config.cfg into memory (target, evaluate_cmd, metric, metric_direction, time_budget).
- read .autoresearch/{domain}/{name}/program.md for strategy and constraints.
- read .autoresearch/{domain}/{name}/results.tsv into memory. parse rows as list of (commit_hash, metric_value, status, description).
ensure on correct branch: git checkout autoresearch/{domain}/{name}. verify HEAD matches latest commit hash in results.tsv (if results.tsv is non-empty).
decide next change:
- review results.tsv: which changes kept? which crashed? which directions (code style, algorithm, parameter) haven't been tried?
- apply strategy escalation: runs 1-5 are obvious wins (caching, removing debug prints), runs 6-15 are systematic (vary one hyperparameter at a time), runs 16-30 are structural (swap algorithm), 30+ are radical (rewrite subsystem).
- check simplicity criterion: prefer small improvement with simple code over big improvement with complex code.
- decide one atomic change to target_file (one variable, one function, one config line).
- output to user: what you're about to try and why.
edit target_file: make the single change. do not modify evaluate.py or results.tsv.
commit change: git add {target_file} && git commit -m "experiment: {description of change}". capture commit hash.
run evaluation:
- execute evaluate_cmd in a subprocess with timeout of time_budget_minutes * 2.5 seconds.
- capture stdout and stderr.
- if subprocess exceeds timeout, kill it, log "crash (timeout)" in status, log metric as "N/A", skip to step 10.
- if subprocess returns non-zero exit code, log "crash ({error type})" in status, log metric as "N/A", skip to step 10.
- if subprocess succeeds, parse stdout for metric: value pattern. extract value as float. if metric not found in output, log "crash (metric not found)", log metric as "N/A", skip to step 10.
compare to baseline (if results.tsv has any previous "keep" rows):
- find best previous metric value among rows where status == "keep".
- if metric_direction == "lower", new_value < best_value is improvement. if metric_direction == "higher", new_value > best_value is improvement.
- if improvement (or first run with no previous keeps), set status to "keep". else set status to "discard".
handle status:
- if status == "keep": leave commit in place. log to results.tsv.
- if status == "discard": git reset --hard HEAD~1. log to results.tsv (this row records the discarded attempt, metric, and description for learning).
- if status == "crash": git reset --hard HEAD~1. log to results.tsv. count consecutive crashes (crashes in last 5 iterations). if 5+ consecutive crashes, halt and alert user with message "5 consecutive crashes detected. pause and review strategy in program.md or try different approach."
log result: append row to results.tsv: {commit_hash} | {metric_value} | {status} | {description}.
self-improvement check: if total rows in results.tsv is multiple of 10 (i.e., every 10 experiments), scan results.tsv for patterns. examples: "all cache-related changes keep", "refactoring attempts never help metric", "simple parameter tweaks most reliable". append findings to program.md under "Learned Patterns" section. this informs future iterations.
decide next action:

if --single flag: stop here, return results.tsv row and summary to user.
if --loop flag: sleep for configurable interval (10m, 1h, daily, weekly, monthly). go to step 2.
if user interrupted: halt, ensure results.tsv saved, commit any pending changes.

phase 3: viewing and exporting results

input: user requests log_results.py with flags like --experiment, --domain, --dashboard, --format, --output.
read results.tsv for specified experiment(s).
parse and aggregate: for each experiment, compute: run count, kept count, best metric, delta from start, status (active/paused/done).
format output:
- if --format tsv: output raw results.tsv.
- if --format csv: convert to csv, proper quoting.
- if --format markdown: render as markdown table with headers.
- if --dashboard: show summary table across all experiments or domain.
output to stdout or file (if --output specified).

decision points

if evaluate_cmd fails on baseline: halt setup. output error, ask user to debug eval command offline and retry setup.
if metric_direction not specified: ask explicitly: "is lower or higher better for {metric}?" no default. must have answer before loop starts.
if target_file not in git: halt before first commit. output: "target_file must be in a git repo. run git init && git add . && git commit -m 'initial' first."
if results.tsv shows no improvement in last 20 runs: suggest (do not force) reviewing and updating strategy section of program.md. offer to continue or pivot to new approach.
if agent attempts to modify evaluate.py: hard stop. output: "evaluate.py is locked. modifying it invalidates all comparisons. revert and try different target_file change." do not commit the change.
if 5 consecutive crashes occur: halt loop. alert user: "5 consecutive crashes. likely fundamental issue with target or eval setup. pause, debug, resume when ready." provide recent crash descriptions to help diagnose.
if eval takes longer than time_budget_minutes: log timeout crash on first few runs. proactive: on setup, measure eval time on baseline. if baseline eval is slower than time_budget, suggest increasing time_budget.
if llm_judge_ evaluator called but api key missing or expired*: eval crashes with auth error. agent logs crash. alert user to refresh api key in env var.
if experiment branch conflicts with existing branch: warn user: "branch autoresearch/{domain}/{name} already exists and may have diverged. reset to main and restart, or resume existing experiment?" user chooses.
if results.tsv is corrupted or unreadable: agent falls back to empty history, logs warning, continues from run 1.

output contract

setup phase outputs

directory structure created at .autoresearch/{domain}/{name}/ (project scope) or ~/.autoresearch/{domain}/{name}/ (user scope).
config.cfg file contains keys: target, evaluate_cmd, metric, metric_direction, time_budget_minutes. format: ini-style or yaml.
program.md file contains human-readable sections: objectives, constraints, strategy, learned patterns (updated every 10 runs).
results.tsv file created with header row commit | metric | status | description. initially empty (header only).
evaluate.py (if applicable) copied into experiment dir, marked read-only.
git branch autoresearch/{domain}/{name} created and checked out. initial commit recorded.

per-iteration outputs

results.tsv appended with one row per iteration: {git_commit_hash} | {metric_value_or_NA} | {keep|discard|crash} | {human_description}.
git log shows one commit per kept experiment (discarded and crashed runs do not appear in log, only in results.tsv).
stdout from agent includes: change attempted, eval output snippet, status (keep/discard/crash), metric value, next strategy.

viewing outputs

single experiment view: tabular display of all runs, columns: iteration#, commit, metric, status, description. sortable by metric or date.
dashboard view: summary table across experiments, columns: domain, name, run_count, kept_count, best_metric, delta_from_start, status.
export formats: tsv (raw), csv (quoted), markdown (github-compatible table).

outcome signal

success criteria for user to confirm skill worked:

setup succeeded: .autoresearch/{domain}/{name}/ directory exists, config.cfg and program.md are readable, results.tsv header is present, no errors during setup.
first iteration ran: results.tsv has at least one data row (not just header). status is "keep", "discard", or "crash", metric_value is numeric or "N/A".
improvements compound: after 10+ iterations, best_metric has moved in the metric_direction (lower for latency, higher for quality). ideally 5-10% improvement per 10 runs.
git history is clean: git log autoresearch/{domain}/{name} shows only kept commits. discarded and crashed runs do not pollute history.
loop continues autonomously: if --loop flag active, agent wakes every interval (10m, 1h, daily, etc.) and runs new iteration without user intervention.
self-improvement visible: after 20+ iterations, program.md "Learned Patterns" section contains actionable insights (e.g., "parameter X always helps", "avoid structural changes, simple tweaks work better").
no manual intervention needed: agent handles crashes, reverts, and logging without user restarting or debugging each run.
results exportable and readable: python scripts/log_results.py --experiment {domain}/{name} --format markdown produces human-friendly table. dashboard view shows progress across multiple experiments.
resumable state: if process stops (context limit, user pause), next session resumes from results.tsv state. no runs are lost or repeated.

related skills

self-improving-agent: improves an agent's own memory and rules over time. not for structured experiment loops. complementary if autoresearch agent needs to refine its own strategy mid-loop.
senior-ml-engineer: ml architecture decisions and design review. complementary for initial system design, autoresearch optimizes after design is stable.
tdd-guide: test-driven development. complementary, test suite can serve as evaluation_cmd (metric: pass_rate).
skill-security-auditor: audit skills before publishing. orthogonal, not for optimization loops.

credits: original author Alireza Rezvani, inspired by Karpathy's autoresearch. enriched for Implexa quality standards.