Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when somet...
---
name: diagnose
description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when something is broken, throwing errors, failing tests, or performing poorly.
author: GenorTG
attribution: Adapted from the GSD Core project by TÂCHES / open-gsd (https://github.com/open-gsd/gsd-core)
---
# Diagnose
A discipline for hard bugs. **Skip phases only when explicitly justified.**
When exploring the codebase, use the project's domain glossary to get a clear mental model of the relevant modules, and check ADRs in the area you're touching.
## Phase 1 — Build a Feedback Loop
**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause. If you don't, no amount of staring at code will help.
Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.**
### Ways to construct one — try in roughly this order
1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e.
2. **Curl / HTTP script** against a running dev server.
3. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot.
4. **Browser snapshot** via Playwright / Puppeteer — drives the UI, asserts on DOM/console/network.
5. **Replay a captured trace.** Save a real request/payload/event log; replay it through the code path in isolation.
6. **Throwaway harness.** Minimal subset of the system that exercises the bug code path with a single call.
7. **Property / fuzz loop.** If the bug is "sometimes wrong output," run 1000 random inputs and look for the failure mode.
8. **Bisection harness.** Automate "boot at state X, check, repeat" so you can `git bisect run` it.
9. **Differential loop.** Run same input through old vs new version, diff outputs.
### Iterate on the loop itself
Once you have a loop, ask:
- Can I make it **faster**? (Skip unrelated init, narrow scope.)
- Can I make the **signal sharper**? (Assert on specific symptom, not "didn't crash.")
- Can I make it more **deterministic**? (Pin time, seed RNG, isolate filesystem.)
A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
### Non-deterministic bugs
The goal is not a clean repro but a **higher reproduction rate**. Loop 100×, parallelise, add stress, narrow timing windows. A 50%-flake bug is debuggable; 1% is not — keep raising the rate.
### When you genuinely cannot build a loop
Stop and say so explicitly. List what you tried. Ask the user for: (a) access to the reproducing environment, (b) a captured artifact (HAR file, log dump, core dump), or (c) permission to add temporary production instrumentation. **Do not** proceed to hypothesise without a loop.
## Phase 2 — Reproduce
Run the loop. Watch the bug appear.
Confirm:
- [ ] The loop produces the failure mode the **user** described — not a different nearby failure
- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, at a high enough rate)
- [ ] You have captured the exact symptom so later phases can verify the fix
Do not proceed until you reproduce the bug.
## Phase 3 — Hypothesise
Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
Each hypothesis must be **falsifiable**: state the prediction it makes.
> Format: "If \<X\> is the cause, then \<changing Y\> will make the bug disappear / \<changing Z\> will make it worse."
If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it.
**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just deployed a change to #3"). Don't block on it — proceed if the user is AFK.
## Phase 4 — Instrument
Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.**
Tool preference:
1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs.
2. **Targeted logs** at the boundaries that distinguish hypotheses.
3. Never "log everything and grep."
**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Tagged logs die; untagged logs survive.
**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler), then bisect. Measure first, fix second.
## Phase 5 — Fix + Regression Test
Write the regression test **before the fix** — but only if there is a **correct seam** for it.
A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow, a regression test there gives false confidence.
If no correct seam exists, **that itself is the finding.** Flag it — the architecture is preventing the bug from being locked down.
If a correct seam exists:
1. Turn the minimised repro into a failing test at that seam.
2. Watch it fail.
3. Apply the fix.
4. Watch it pass.
5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario.
## Phase 6 — Cleanup + Post-Mortem
Required before declaring done:
- [ ] Original repro no longer reproduces (re-run the Phase 1 loop)
- [ ] Regression test passes (or absence of seam is documented)
- [ ] All `[DEBUG-...]` instrumentation removed (`grep` the prefix)
- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location)
- [ ] The correct hypothesis is stated in the commit / PR message — so the next debugger learns
**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers), consider running the `/improve-codebase-architecture` skill with the specifics.don't have the plugin yet? install it then click "run inline in claude" again.