Item: experiment-designer
Rating: 7.2
Author: Implexa

Use when planning product experiments, writing testable hypotheses, estimating sample size, prioritizing tests, or interpreting A/B outcomes with practical…

SKILL.md

Experiment Designer

Design, prioritize, and evaluate product experiments with clear hypotheses and defensible decisions.

When To Use

Use this skill for:

A/B and multivariate experiment planning

Hypothesis writing and success criteria definition

Sample size and minimum detectable effect planning

Experiment prioritization with ICE scoring

Reading statistical output for product decisions

Core Workflow

Write hypothesis in If/Then/Because format

If we change [intervention]

Then [metric] will change by [expected direction/magnitude]

Because [behavioral mechanism]

Define metrics before running test

Primary metric: single decision metric

Guardrail metrics: quality/risk protection

Secondary metrics: diagnostics only

Estimate sample size

Baseline conversion or baseline mean

Minimum detectable effect (MDE)

Significance level (alpha) and power

Use:

python3 scripts/sample_size_calculator.py --baseline-rate 0.12 --mde 0.02 --mde-type absolute

Prioritize experiments with ICE

Impact: potential upside

Confidence: evidence quality

Ease: cost/speed/complexity

ICE Score = (Impact * Confidence * Ease) / 10

Launch with stopping rules

Decide fixed sample size or fixed duration in advance

Avoid repeated peeking without proper method

Monitor guardrails continuously

Interpret results

Statistical significance is not business significance

Compare point estimate + confidence interval to decision threshold

Investigate novelty effects and segment heterogeneity

Hypothesis Quality Checklist

 Contains explicit intervention and audience

 Specifies measurable metric change

 States plausible causal reason

 Includes expected minimum effect

 Defines failure condition

Common Experiment Pitfalls

Underpowered tests leading to false negatives

Running too many simultaneous changes without isolation

Changing targeting or implementation mid-test

Stopping early on random spikes

Ignoring sample ratio mismatch and instrumentation drift

Declaring success from p-value without effect-size context

Statistical Interpretation Guardrails

p-value < alpha indicates evidence against null, not guaranteed truth.

Confidence interval crossing zero/no-effect means uncertain directional claim.

Wide intervals imply low precision even when significant.

Use practical significance thresholds tied to business impact.

See:

references/experiment-playbook.md

references/statistics-reference.md

Tooling

scripts/sample_size_calculator.py

Computes required sample size (per variant and total) from:

baseline rate

MDE (absolute or relative)

significance level (alpha)

statistical power

Example:

python3 scripts/sample_size_calculator.py \
  --baseline-rate 0.10 \
  --mde 0.015 \
  --mde-type absolute \
  --alpha 0.05 \
  --power 0.8

experiment-designer

SKILL.md

related skills