skills.shby @obra

systematic-debugging

Use when encountering any bug, test failure, or unexpected behavior, before proposing fixes

view source

installs

stars

karma

view original SKILL.md from skills.shclick to expand

Systematic Debugging

Overview

Random fixes waste time and create new bugs. Quick patches mask underlying issues.

Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.

Violating the letter of this process is violating the spirit of debugging.

The Iron Law

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

If you haven't completed Phase 1, you cannot propose fixes.

When to Use
Use for ANY technical issue:

Test failures

Bugs in production

Unexpected behavior

Performance problems

Build failures

Integration issues

Use this ESPECIALLY when:

Under time pressure (emergencies make guessing tempting)

"Just one quick fix" seems obvious

You've already tried multiple fixes

Previous fix didn't work

You don't fully understand the issue

Don't skip when:

Issue seems simple (simple bugs have root causes too)

You're in a hurry (rushing guarantees rework)

Manager wants it fixed NOW (systematic is faster than thrashing)

The Four Phases

You MUST complete each phase before proceeding to the next.

Phase 1: Root Cause Investigation

BEFORE attempting ANY fix:

Read Error Messages Carefully

Don't skip past errors or warnings

They often contain the exact solution

Read stack traces completely

Note line numbers, file paths, error codes

Reproduce Consistently

Can you trigger it reliably?

What are the exact steps?

Does it happen every time?

If not reproducible → gather more data, don't guess

Check Recent Changes

What changed that could cause this?

Git diff, recent commits

New dependencies, config changes

Environmental differences

Gather Evidence in Multi-Component Systems

WHEN system has multiple components (CI → build → signing, API → service → database):

BEFORE proposing fixes, add diagnostic instrumentation:

For EACH component boundary:
  - Log what data enters component
  - Log what data exits component
  - Verify environment/config propagation
  - Check state at each layer

Run once to gather evidence showing WHERE it breaks
THEN analyze evidence to identify failing component
THEN investigate that specific component

Example (multi-layer system):

# Layer 1: Workflow
echo "=== Secrets available in workflow: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"

# Layer 2: Build script
echo "=== Env vars in build script: ==="
env | grep IDENTITY || echo "IDENTITY not in environment"

# Layer 3: Signing script
echo "=== Keychain state: ==="
security list-keychains
security find-identity -v

# Layer 4: Actual signing
codesign --sign "$IDENTITY" --verbose=4 "$APP"

This reveals: Which layer fails (secrets → workflow ✓, workflow → build ✗)

Trace Data Flow

WHEN error is deep in call stack:

See root-cause-tracing.md in this directory for the complete backward tracing technique.

Quick version:

Where does bad value originate?

What called this with bad value?

Keep tracing up until you find the source

Fix at source, not at symptom

Phase 2: Pattern Analysis

Find the pattern before fixing:

Find Working Examples

Locate similar working code in same codebase

What works that's similar to what's broken?

Compare Against References

If implementing pattern, read reference implementation COMPLETELY

Don't skim - read every line

Understand the pattern fully before applying

Identify Differences

What's different between working and broken?

List every difference, however small

Don't assume "that can't matter"

Understand Dependencies

What other components does this need?

What settings, config, environment?

What assumptions does it make?

Phase 3: Hypothesis and Testing

Scientific method:

Form Single Hypothesis

State clearly: "I think X is the root cause because Y"

Write it down

Be specific, not vague

Test Minimally

Make the SMALLEST possible change to test hypothesis

One variable at a time

Don't fix multiple things at once

Verify Before Continuing

Did it work? Yes → Phase 4

Didn't work? Form NEW hypothesis

DON'T add more fixes on top

When You Don't Know

Say "I don't understand X"

Don't pretend to know

Ask for help

Research more

Phase 4: Implementation

Fix the root cause, not the symptom:

Create Failing Test Case

Simplest possible reproduction

Automated test if possible

One-off test script if no framework

MUST have before fixing

Use the superpowers:test-driven-development skill for writing proper failing tests

Implement Single Fix

Address the root cause identified

ONE change at a time

No "while I'm here" improvements

No bundled refactoring

Verify Fix

Test passes now?

No other tests broken?

Issue actually resolved?

If Fix Doesn't Work

STOP

Count: How many fixes have you tried?

If < 3: Return to Phase 1, re-analyze with new information

If ≥ 3: STOP and question the architecture (step 5 below)

DON'T attempt Fix #4 without architectural discussion

If 3+ Fixes Failed: Question Architecture

Pattern indicating architectural problem:

Each fix reveals new shared state/coupling/problem in different place

Fixes require "massive refactoring" to implement

Each fix creates new symptoms elsewhere

STOP and question fundamentals:

Is this pattern fundamentally sound?

Are we "sticking with it through sheer inertia"?

Should we refactor architecture vs. continue fixing symptoms?

Discuss with your human partner before attempting more fixes

This is NOT a failed hypothesis - this is a wrong architecture.

Red Flags - STOP and Follow Process

If you catch yourself thinking:

"Quick fix for now, investigate later"

"Just try changing X and see if it works"

"Add multiple changes, run tests"

"Skip the test, I'll manually verify"

"It's probably X, let me fix that"

"I don't fully understand but this might work"

"Pattern says X but I'll adapt it differently"

"Here are the main problems: [lists fixes without investigation]"

Proposing solutions before tracing data flow

"One more fix attempt" (when already tried 2+)

Each fix reveals new problem in different place

ALL of these mean: STOP. Return to Phase 1.

If 3+ fixes failed: Question the architecture (see Phase 4.5)

your human partner's Signals You're Doing It Wrong

Watch for these redirections:

"Is that not happening?" - You assumed without verifying

"Will it show us...?" - You should have added evidence gathering

"Stop guessing" - You're proposing fixes without understanding

"Ultra-think this" - Question fundamentals, not just symptoms

"We're stuck?" (frustrated) - Your approach isn't working

When you see these: STOP. Return to Phase 1.

Common Rationalizations

Excuse
Reality

"Issue is simple, don't need process"
Simple issues have root causes too. Process is fast for simple bugs.

"Emergency, no time for process"
Systematic debugging is FASTER than guess-and-check thrashing.

"Just try this first, then investigate"
First fix sets the pattern. Do it right from the start.

"I'll write test after confirming fix works"
Untested fixes don't stick. Test first proves it.

"Multiple fixes at once saves time"
Can't isolate what worked. Causes new bugs.

"Reference too long, I'll adapt the pattern"
Partial understanding guarantees bugs. Read it completely.

"I see the problem, let me fix it"
Seeing symptoms ≠ understanding root cause.

"One more fix attempt" (after 2+ failures)
3+ failures = architectural problem. Question pattern, don't fix again.

Quick Reference

Phase
Key Activities
Success Criteria

1. Root Cause
Read errors, reproduce, check changes, gather evidence
Understand WHAT and WHY

2. Pattern
Find working examples, compare
Identify differences

3. Hypothesis
Form theory, test minimally
Confirmed or new hypothesis

4. Implementation
Create test, fix, verify
Bug resolved, tests pass

When Process Reveals "No Root Cause"

If systematic investigation reveals issue is truly environmental, timing-dependent, or external:

You've completed the process

Document what you investigated

Implement appropriate handling (retry, timeout, error message)

Add monitoring/logging for future investigation

But: 95% of "no root cause" cases are incomplete investigation.

Supporting Techniques

These techniques are part of systematic debugging and available in this directory:

root-cause-tracing.md - Trace bugs backward through call stack to find original trigger

defense-in-depth.md - Add validation at multiple layers after finding root cause

condition-based-waiting.md - Replace arbitrary timeouts with condition polling

Related skills:

superpowers:test-driven-development - For creating failing test case (Phase 4, Step 1)

superpowers:verification-before-completion - Verify fix worked before claiming success

Real-World Impact

From debugging sessions:

Systematic approach: 15-30 minutes to fix

Random fixes approach: 2-3 hours of thrashing

First-time fix rate: 95% vs 40%

New bugs introduced: Near zero vs common

don't have the plugin yet? install it then click "run inline in claude" again.

restructured original into implexa six-component format, made implicit decision logic and multi-layer debugging explicit, added edge cases for timing-dependent issues and environmental problems, preserved all procedural content and author intent while applying tech-bro voice rules.

intent

systematic debugging eliminates random fix-and-check cycles that waste time and introduce new bugs. root cause investigation always comes before any fix attempt. this skill enforces a disciplined four-phase process (investigate, analyze, hypothesize, implement) that forces you to understand what broke and why before touching code. use it for any technical issue: test failures, production bugs, unexpected behavior, performance problems, build failures, integration issues. it's especially critical under time pressure, when fixes seem obvious, or when previous attempts failed.

inputs

context needed:

error messages, stack traces, log output
reproduction steps or failing test case
git history and recent changes (commits, dependencies, config)
system architecture documentation (component boundaries, data flow)
access to affected systems for testing and instrumentation
understanding of your own codebase layout and conventions

external connections:

git (local repo access, recent commit history)
test framework (pytest, jest, mocha, etc.)
logging/monitoring system if available (datadog, cloudwatch, etc.)
version control platform (github, gitlab, etc.) for diff/history
issue tracker if tracking the bug (jira, linear, github issues)

edge cases to anticipate:

issue is timing-dependent or intermittent (race conditions, timeouts)
issue only reproduces in specific environments (staging vs. prod, specific OS)
issue involves external services that are flaky or unavailable
multi-component system with complex data flow between layers
insufficient logging/instrumentation to trace the problem
previous developer deleted context or didn't document assumptions

procedure

phase 1: root cause investigation

step 1: read error messages completely

input: error message, stack trace, warning output
do not skim or skip past any line
note exact line numbers, file paths, error codes
output: annotated error message with key details highlighted

step 2: reproduce the issue consistently

input: issue description, suspected trigger conditions
execute exact reproduction steps multiple times
verify it happens every time or identify conditions when it does/doesn't occur
if not reproducible, gather more data (logs, metrics, user reports) instead of guessing
output: documented reproduction steps with frequency (always, intermittent, one-time)

step 3: check recent changes

input: git history, recent commits, dependency updates, config changes
run git diff and git log to find what changed
review new dependencies, config file changes, environment differences
correlate timing of changes with issue first appearance
output: list of suspect changes with timestamps

step 4: gather diagnostic evidence in multi-component systems

input: system architecture, component boundaries, data flow diagram
when system has multiple layers/services (workflow → build → signing, api → service → database), instrument each boundary before analyzing
for each component boundary, add logging that shows: data entering component, data exiting component, environment/config state, errors at that layer
run once to collect evidence showing which exact component fails
output: evidence log showing data flow through each layer with success/failure markers

step 5: trace data flow backward from symptom to source

input: failing component identified, error output, code call stack
use root-cause-tracing.md technique if error is deep in call stack
ask: where does the bad value originate? what called this with bad value? keep tracing up the call chain
output: traced chain from symptom back to original source with bad data

phase 2: pattern analysis

step 6: find working examples in same codebase

input: broken code, codebase, similar working implementations
locate similar code that works correctly
compare structure, variable names, logic flow
output: side-by-side comparison of working vs. broken code

step 7: compare against reference implementation

input: official pattern, documentation, library examples
read the reference completely, don't skim
understand every line, all assumptions, all dependencies
output: annotated reference with notes on what your code missed

step 8: identify all differences between working and broken

input: working example, broken code, reference implementation
list every difference, no matter how small
don't assume differences don't matter
categorize: logic differences, config differences, environment differences, dependency differences
output: exhaustive difference list with severity assessment

step 9: understand all dependencies

input: broken component, codebase documentation
identify all other components this needs
map required config, environment variables, external services
verify assumptions about state and initialization
output: dependency map with required state

phase 3: hypothesis and testing

step 10: form single explicit hypothesis

input: root cause investigation evidence, pattern analysis
state clearly in writing: "i think X is the root cause because Y"
be specific (not "something is wrong" but "IDENTITY env var is not propagated from workflow to build step")
output: written hypothesis statement

step 11: test hypothesis with minimal change

input: hypothesis, affected code, test framework
make the smallest possible change to test the theory
change one variable at a time
run the test that previously failed
output: test result (pass or fail)

step 12: verify result before proceeding

input: test result, original failure condition
if test passes: move to phase 4
if test fails: document why this hypothesis was wrong, return to step 10 (form new hypothesis)
do not add more fixes on top of a failed fix
output: confirmation of hypothesis validity or statement of new hypothesis

phase 4: implementation

step 13: create failing test case

input: reproduction steps, test framework
write the simplest possible test that reproduces the issue
use automated test if possible, one-off test script if no framework available
must have a failing test before attempting any fix
reference superpowers:test-driven-development for proper test structure
output: failing test in version control

step 14: implement single fix addressing root cause

input: identified root cause, failing test
implement one change that addresses the root cause
one change at a time, no bundled fixes
resist "while i'm here" improvements and refactoring
output: code change with git commit message explaining why

step 15: verify fix works completely

input: modified code, test suite
run the failing test, confirm it passes
run full test suite, confirm no other tests broke
verify original issue actually resolved in real system
output: green test results, confirmed issue resolution

step 16: count failed fix attempts

input: history of attempted fixes, current result
if this is fix attempt 1 or 2 and it didn't work: return to phase 1 with new information
if this is fix attempt 3 or more and still failing: move to step 17
output: decision to continue debugging or escalate

step 17: question architecture (only after 3+ failed fixes)

input: failed fixes, system design, component interactions
evaluate if pattern is fundamentally sound or stuck due to inertia
identify if fixes are creating new symptoms in different places
identify if all fixes require massive refactoring to implement
discuss architectural concerns with your human partner before attempting fix #4
output: architectural assessment and decision to refactor vs. continue patching

decision points

if issue is not reproducible:

do not guess at fix
gather more data: logs, metrics, user reports, system state
return to phase 1 with better evidence

if issue is in multi-component system:

add instrumentation at each boundary before analyzing
trace data flow through each layer to find breaking point
don't assume "it's probably the API" without evidence

if error is deep in call stack:

use backward tracing technique (root-cause-tracing.md)
don't fix the symptom at the bottom of stack
trace back to find original bad input

if you catch yourself proposing fixes without completing phase 1:

stop immediately
common rationalization: "issue seems simple" or "i'm in a hurry"
return to root cause investigation
systematic debugging is faster than thrashing

if you've already tried 2 fixes and neither worked:

don't attempt fix #3 without returning to phase 1
the issue may have changed or you may have misunderstood root cause
new information from failed attempts must be re-analyzed

if 3 or more fixes have failed:

stop fixing
this signals architectural problem, not hypothesis problem
discuss with human partner before attempting more fixes
may require refactoring instead of patching

if investigation reveals issue is environmental or timing-dependent:

document exactly what you investigated
you've completed the process correctly
implement appropriate handling: retry logic, timeouts, error messages
add monitoring/logging for future investigation
note: 95% of "no root cause found" cases are incomplete investigation

if your human partner uses redirect language:

"is that not happening?" → you assumed without verifying, return to phase 1
"will it show us...?" → you need more evidence gathering
"stop guessing" → you're proposing fixes without understanding
"ultra-think this" → question fundamentals, not just symptoms
"we're stuck?" (frustrated) → your approach isn't working, restart from phase 1

output contract

successful debugging produces:

documented root cause: written explanation of what was broken and why, with evidence
failing test case: automated test that reproduces the issue (or test script), stored in version control
minimal fix: single code change addressing root cause (not symptom), with git commit explaining the change
passing tests: original failing test now passes, all other tests still pass
verified resolution: confirmation that original issue is actually fixed in real system (not just in test)
optional architectural notes: if 3+ fixes failed, document architectural concerns and refactoring discussion

if any of these are missing, the debugging process is incomplete.

outcome signal

you know systematic debugging worked when:

you can clearly explain what was broken and why (you understand root cause, not just symptom)
the failing test passes and stays passing
all other tests continue passing (no new breakage)
the original issue is actually resolved (not masked)
the fix is minimal and doesn't require massive refactoring
you didn't need more than 2-3 focused fix attempts
you can point to the exact change (git commit) that fixed it
team can reproduce and understand your fix without asking questions

anti-signals (you're doing it wrong):

you have a theory but haven't written a failing test
you've tried more than 3 fixes and issue still occurs
you fixed the symptom but don't understand root cause
you added multiple changes at once and don't know which one fixed it
you skipped phases 1-3 and jumped to implementing
you're proposing a fix without completing root cause investigation
you're under time pressure and saying "just try this quick thing"