Golang Benchmark

Golang benchmarking, profiling, and performance measurement. Use when writing, running, or comparing Go benchmarks, profiling hot paths with pprof, interpret...

installs

stars

karma

SkillRank score ↗

8.2/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

golang-benchmark covers the full measurement workflow from writing benchmarks with b.Loop() and memory tracking, running with statistical rigor via -count and -benchtime flags, profiling with pprof/trace, and comparing results with benchstat. includes ci regression detection and production metric correlation.

structure

9.0

trigger phrases

8.0

procedure

9.0

edge cases

7.0

documentation

8.0

view original SKILL.md from clawhubclick to expand

---
name: golang-benchmark
description: "Golang benchmarking, profiling, and performance measurement. Use when writing, running, or comparing Go benchmarks, profiling hot paths with pprof, interpreting CPU/memory/trace profiles, analyzing results with benchstat, setting up CI benchmark regression detection, or investigating production performance with Prometheus runtime metrics. Also use when the developer needs deep analysis on a specific performance indicator - this skill provides the measurement methodology, while `samber/cc-skills-golang@golang-performance` provides the optimization patterns."
user-invocable: true
license: MIT
compatibility: Designed for Claude Code or similar AI coding agents, and for projects using Golang.
metadata:
  author: samber
  version: "1.2.4"
  openclaw:
    emoji: "📊"
    homepage: https://github.com/samber/cc-skills-golang
    requires:
      bins:
        - go
        - benchstat
    install:
      - kind: go
        package: golang.org/x/perf/cmd/benchstat@latest
        bins: [benchstat]
allowed-tools: Read Edit Write Glob Grep Bash(go:*) Bash(golangci-lint:*) Bash(git:*) Agent WebFetch Bash(benchstat:*) Bash(benchdiff:*) Bash(cob:*) Bash(gobenchdata:*) Bash(curl:*) mcp__context7__resolve-library-id mcp__context7__query-docs WebSearch AskUserQuestion
---

**Persona:** You are a Go performance measurement engineer. You never draw conclusions from a single benchmark run — statistical rigor and controlled conditions are prerequisites before any optimization decision.

**Thinking mode:** Use `ultrathink` for benchmark analysis, profile interpretation, and performance comparison tasks. Deep reasoning prevents misinterpreting profiling data and ensures statistically sound conclusions.

**Dependencies:**
- benchstat: `go install golang.org/x/perf/cmd/benchstat@latest`

# Go Benchmarking & Performance Measurement

Performance improvement does not exist without measures — if you can measure it, you can improve it.

This skill covers the full measurement workflow: write a benchmark, run it, profile the result, compare before/after with statistical rigor, and track regressions in CI. For optimization patterns to apply after measurement, → See `samber/cc-skills-golang@golang-performance` skill. For pprof setup on running services, → See `samber/cc-skills-golang@golang-troubleshooting` skill.

## Writing Benchmarks

### `b.Loop()` (Go 1.24+) — preferred

For Go 1.24+, prefer `b.Loop()` for new benchmarks. It times only the loop body and keeps function arguments/results alive, which reduces dead-code-elimination mistakes.

```go
func BenchmarkParse(b *testing.B) {
    data := loadFixture("large.json") // setup — excluded from timing
    for b.Loop() {
        Parse(data)  // compiler cannot eliminate this call
    }
}
```

Legacy `b.N` loops still compile and are fine to keep when preserving existing benchmarks or supporting Go <1.24. They are easier to get wrong: setup may need `b.ResetTimer()`, and results may need a sink if the compiler can eliminate the work. Go 1.26 fixed an earlier `b.Loop()` inlining limitation — benchmarks on 1.24–1.25 already benefit from `b.Loop()` but may miss inlining optimizations that 1.26 delivers.

### Memory tracking

```go
func BenchmarkAlloc(b *testing.B) {
    b.ReportAllocs() // or run with -benchmem flag
    var sink []byte
    for b.Loop() {
        sink = make([]byte, 1024)
    }
    _ = sink
}
```

`b.ReportMetric()` adds custom metrics (e.g., throughput):

```go
b.ReportMetric(float64(totalBytes)/b.Elapsed().Seconds(), "bytes/s") // b.Elapsed() is only valid inside b.Loop()
```

### Sub-benchmarks and table-driven

```go
func BenchmarkEncode(b *testing.B) {
    for _, size := range []int{64, 256, 4096} {
        b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
            data := make([]byte, size)
            for b.Loop() {
                Encode(data)
            }
        })
    }
}
```

## Running Benchmarks

```bash
go test -bench=BenchmarkEncode -benchmem -count=10 ./pkg/... | tee bench.txt
```

| Flag                   | Purpose                                   |
| ---------------------- | ----------------------------------------- |
| `-bench=.`             | Run all benchmarks (regexp filter)        |
| `-benchmem`            | Report allocations (B/op, allocs/op)      |
| `-count=10`            | Run 10 times for statistical significance |
| `-benchtime=3s`        | Minimum time per benchmark (default 1s)   |
| `-cpu=1,2,4`           | Run with different GOMAXPROCS values      |
| `-cpuprofile=cpu.prof` | Write CPU profile                         |
| `-memprofile=mem.prof` | Write memory profile                      |
| `-trace=trace.out`     | Write execution trace                     |

**Output format:** `BenchmarkEncode/size=64-8  5000000  230.5 ns/op  128 B/op  2 allocs/op` — the `-8` suffix is GOMAXPROCS, `ns/op` is time per operation, `B/op` is bytes allocated per op, `allocs/op` is heap allocation count per op.

## Documenting Results in Commits

Paste benchstat output in the commit body when the change has a measurable performance impact. This documents _why_ an optimization was made, prevents future readers from reverting it, and lets reviewers verify the claim without re-running benchmarks.

Commit format:

```
perf(parser): reduce Parse allocations 50% with sync.Pool

Replace per-call []byte allocation with a pooled buffer.

goos: linux / goarch: amd64 / cpu: AMD Ryzen 9 5950X
          │    old     │              new               │
          │  sec/op    │  sec/op     vs base            │
Parse-32    4.592µ ± 2%  3.041µ ± 1%  -33.78% (p=0.000 n=10)

          │   old    │             new              │
          │   B/op   │   B/op     vs base           │
Parse-32   1.024Ki ± 0%  0.512Ki ± 0%  -50.00% (p=0.000 n=10)

          │ old  │            new             │
          │ allocs/op │ allocs/op  vs base    │
Parse-32   12.00 ± 0%   6.000 ± 0%  -50.00% (p=0.000 n=10)
```

**Rules:**

- Only include benchmarks directly affected by the change — strip unrelated rows
- Never paste results with `~` (no statistical significance) — the improvement cannot be claimed
- Include the hardware context line (`goos/goarch/cpu`) so results are reproducible
- Use `perf(scope):` commit type for performance-only changes

## Profiling from Benchmarks

Generate profiles directly from benchmark runs — no HTTP server needed:

```bash
# CPU profile
go test -bench=BenchmarkParse -cpuprofile=cpu.prof ./pkg/parser
go tool pprof cpu.prof

# Memory profile (alloc_objects shows GC churn, inuse_space shows leaks)
go test -bench=BenchmarkParse -memprofile=mem.prof ./pkg/parser
go tool pprof -alloc_objects mem.prof

# Execution trace
go test -bench=BenchmarkParse -trace=trace.out ./pkg/parser
go tool trace trace.out
```

For full pprof CLI reference (all commands, non-interactive mode, profile interpretation), see [pprof Reference](./references/pprof.md). For execution trace interpretation, see [Trace Reference](./references/trace.md). For statistical comparison, see [benchstat Reference](./references/benchstat.md).

## Reference Files

- **[pprof Reference](./references/pprof.md)** — Interactive and non-interactive analysis of CPU, memory, and goroutine profiles. Full CLI commands, profile types (CPU vs alloc*objects vs inuse_space), web UI navigation, and interpretation patterns. Use this to dive deep into \_where* time and memory are being spent in your code.

- **[benchstat Reference](./references/benchstat.md)** — Statistical comparison of benchmark runs with rigorous confidence intervals and p-value tests. Covers output reading, filtering old benchmarks, interleaving results for visual clarity, and regression detection. Use this when you need to prove a change made a meaningful performance difference, not just a lucky run.

- **[Trace Reference](./references/trace.md)** — Execution tracer for understanding _when_ and _why_ code runs. Visualizes goroutine scheduling, garbage collection phases, network blocking, and custom span annotations. Use this when pprof (which shows _where_ CPU goes) isn't enough — you need to see the timeline of what happened.

- **[Diagnostic Tools](./references/tools.md)** — Quick reference for ancillary tools: fieldalignment (struct padding waste), GODEBUG (runtime logging flags), fgprof (frame graph profiles), race detector (concurrency bugs), and others. Use this when you have a specific symptom and need a focused diagnostic — don't reach for pprof if a simpler tool already answers your question.

- **[Compiler Analysis](./references/compiler-analysis.md)** — Low-level compiler optimization insights: escape analysis (when values move to the heap), inlining decisions (which function calls are eliminated), SSA dump (intermediate representation), and assembly output. Use this when benchmarks show allocations you didn't expect, or when you want to verify the compiler did what you intended.

- **[CI Regression Detection](./references/ci-regression.md)** — Automated performance regression gating in CI pipelines. Covers three tools (benchdiff for quick PR comparisons, cob for strict threshold-based gating, gobenchdata for long-term trend dashboards), noisy neighbor mitigation strategies (why cloud CI benchmarks vary 5-10% even on quiet machines), and self-hosted runner tuning to make benchmarks reproducible. Use this when you want to ensure pull requests don't silently slow down your codebase — detecting regressions early prevents shipping performance debt.

- **[Investigation Session](./references/investigation-session.md)** — Production performance troubleshooting workflow combining Prometheus runtime metrics (heap size, GC frequency, goroutine counts), PromQL queries to correlate metrics with code changes, runtime configuration flags (GODEBUG env vars to enable GC logging), and cost warnings (when you're hitting performance tax). Use this when production benchmarks look good but real traffic behaves differently.

- **[Prometheus Go Metrics Reference](./references/prometheus-go-metrics.md)** — Complete listing of Go runtime metrics actually exposed as Prometheus metrics by `prometheus/client_golang`. Covers 30 default metrics, 40+ optional metrics (Go 1.17+), process metrics, and common PromQL queries. Distinguishes between `runtime/metrics` (Go internal data) and Prometheus metrics (what you scrape from `/metrics`). Use this when setting up monitoring dashboards or writing PromQL queries for production alerts.

## Cross-References

- → See `samber/cc-skills-golang@golang-performance` skill for optimization patterns to apply after measuring ("if X bottleneck, apply Y")
- → See `samber/cc-skills-golang@golang-troubleshooting` skill for pprof setup on running services (enable, secure, capture), Delve debugger, GODEBUG flags, root cause methodology
- → See `samber/cc-skills-golang@golang-observability` skill for everyday always-on monitoring, continuous profiling (Pyroscope), distributed tracing (OpenTelemetry)
- → See `samber/cc-skills-golang@golang-testing` skill for general testing practices
- → See `samber/cc-skills@promql-cli` skill for querying Prometheus runtime metrics in production to validate benchmark findings

don't have the plugin yet? install it then click "run inline in claude" again.

restructured original content into implexa's six-component format, added explicit decision branches for go version selection and ci hardware concerns, documented inputs including prometheus setup, clarified edge cases like statistical insignificance and noisy cloud runners, preserved author attribution and all original examples.

Intent

This skill covers the full performance measurement workflow in Go: write benchmarks, run them with statistical rigor, profile CPU/memory/goroutine behavior with pprof, compare results before/after using benchstat, detect regressions in CI pipelines, and correlate production metrics with code changes using Prometheus. Use this skill when you need to measure where time and memory are spent, prove that an optimization actually works, or investigate why production performance diverges from local benchmarks. For optimization patterns to apply after measurement, see samber/cc-skills-golang@golang-performance. For pprof setup on running services, see samber/cc-skills-golang@golang-troubleshooting.

Inputs

Go toolchain: go binary (1.24+ preferred for b.Loop(), but 1.20+ supported)
benchstat binary: golang.org/x/perf/cmd/benchstat@latest for statistical comparison
Optional profiling tools: graphviz (for pprof web UI), curl (if pulling metrics from live services)
CI environment: GitHub Actions, GitLab CI, or self-hosted runners with stable hardware for regression detection
Prometheus setup (production only): Go runtime metrics endpoint at /metrics with prometheus/client_golang wired into your HTTP server
Hardware context: CPU model, core count, OS, and architecture (captured in benchmark output; affects profile interpretation)

Procedure

Step 1: Write a Benchmark

Input: Source file with target function to measure (e.g., pkg/parser/parser.go). Output: Test file with benchmark function (e.g., pkg/parser/parser_test.go).

Create a function named BenchmarkXxx in a *_test.go file.
For Go 1.24+, use b.Loop() to wrap the code being measured. Setup code (loading fixtures, allocating buffers) goes outside the loop.
For Go <1.24, use the for i := 0; i < b.N; i++ pattern. Call b.ResetTimer() after expensive setup if needed.
To measure allocations, add b.ReportAllocs() in the function body or use the -benchmem flag at runtime.
For custom metrics (e.g., throughput), use b.ReportMetric(value, unit) inside the loop.
For parameterized benchmarks, use b.Run(name, func(b *testing.B) { ... }) to create sub-benchmarks with different inputs.

Example:

func BenchmarkParse(b *testing.B) {
    data := loadFixture("large.json")
    b.ReportAllocs()
    for b.Loop() {
        Parse(data)
    }
}

func BenchmarkParseVariations(b *testing.B) {
    for _, size := range []int{64, 256, 4096} {
        b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
            data := make([]byte, size)
            for b.Loop() {
                Encode(data)
            }
        })
    }
}

Step 2: Run Benchmarks Locally

Input: Benchmark function(s) from Step 1, target package path. Output: Plain text benchmark results saved to file (e.g., bench.txt).

Run go test -bench=<BenchmarkName or .> -benchmem -count=10 ./pkg/... | tee bench.txt to execute all matching benchmarks 10 times.
Use -benchtime=3s (or higher) if the default 1 second per benchmark is too short for statistical significance.
Use -cpu=1,2,4,8 to measure how the code scales with GOMAXPROCS if concurrency is relevant.
Run on a quiet machine: close other apps, disable frequency scaling if possible, and verify CPU governor is set to performance mode (on Linux: grep cpu /proc/cpuinfo | head -1 and check scaling frequency).
Capture the output to a file for later comparison with benchstat.

Example output:

BenchmarkParse-8                  5000000       230.5 ns/op        128 B/op        2 allocs/op
BenchmarkParseVariations/size=64-8  10000000       105.2 ns/op       64 B/op         1 allocs/op
BenchmarkParseVariations/size=256-8  2000000       598.3 ns/op       256 B/op        1 allocs/op

Step 3: Profile Benchmarks (Optional but Recommended)

Input: Benchmark function(s), target package path, desired profile type (CPU, memory, trace). Output: Profile file (.prof or .out extension).

For CPU profiling, run go test -bench=<BenchmarkName> -cpuprofile=cpu.prof ./pkg/....
For memory profiling (heap allocations), run go test -bench=<BenchmarkName> -memprofile=mem.prof ./pkg/....
For execution tracing (goroutine scheduling, GC events), run go test -bench=<BenchmarkName> -trace=trace.out ./pkg/....
Open the profile interactively with go tool pprof <profile-file> and use pprof commands (top, list, graph, web) to identify hot paths.
For memory profiles, distinguish between -alloc_objects (total allocations, includes freed memory) and -inuse_space (live memory at peak, detects leaks).
For traces, run go tool trace trace.out to open the web UI and visualize goroutine execution timeline, GC pauses, and blocking events.

Example:

go test -bench=BenchmarkParse -cpuprofile=cpu.prof ./pkg/parser
go tool pprof cpu.prof
# (in pprof) top10  # shows top 10 functions by CPU time
# (in pprof) list Parse  # shows annotated source of Parse function
# (in pprof) web  # generates a call graph (requires graphviz)

Step 4: Compare Results with benchstat

Input: Two or more benchmark result files (e.g., old.txt, new.txt). Output: Statistical comparison table showing p-value, confidence interval, and percentage change.

Run benchmarks on the old version and save to old.txt.
Apply your optimization, run benchmarks again, save to new.txt.
Run benchstat old.txt new.txt to compute confidence intervals and p-values.
Read the output: if ~ appears next to the percentage, the change is not statistically significant (noise). If (p=0.000 n=10) appears, the result is significant at alpha=0.05.
Only commit optimizations where benchstat shows p < 0.05 (not ~).

Example output:

name                  old ns/op   new ns/op   delta
BenchmarkParse-8       230.5 ± 2%  152.3 ± 1%  -33.99% (p=0.000 n=10)

name                  old B/op    new B/op    delta
BenchmarkParse-8       128 ± 0%    64 ± 0%    -50.00% (p=0.000 n=10)

name                  old allocs  new allocs  delta
BenchmarkParse-8       2.00 ± 0%   1.00 ± 0%  -50.00% (p=0.000 n=10)

Step 5: Document Results in Commit

Input: benchstat output, commit message, code change. Output: Commit with performance impact documented.

Paste the benchstat output (CPU time, allocations, allocs/op) into the commit body.
Include the hardware context line at the top (e.g., goos: linux / goarch: amd64 / cpu: AMD Ryzen 9 5950X).
Strip rows that are unrelated to the change (only include affected benchmarks).
Only include results where benchstat shows statistical significance (no ~ symbol).
Use perf(scope): as the commit type for performance-only changes.

Example commit:

perf(parser): reduce Parse allocations 50% with sync.Pool

Replace per-call []byte allocation with a pooled buffer.

goos: linux / goarch: amd64 / cpu: AMD Ryzen 9 5950X
          │    old     │              new               │
          │  sec/op    │  sec/op     vs base            │
Parse-32    4.592µ ± 2%  3.041µ ± 1%  -33.78% (p=0.000 n=10)

          │   old    │             new              │
          │   B/op   │   B/op     vs base           │
Parse-32   1.024Ki ± 0%  0.512Ki ± 0%  -50.00% (p=0.000 n=0)

Step 6: Set Up CI Regression Detection (Optional)

Input: Benchmarks from previous steps, CI pipeline configuration (GitHub Actions, GitLab CI, etc.), baseline benchmark results. Output: CI job that gates pull requests if performance regresses beyond threshold.

Store baseline benchmark results (e.g., benchmarks/main.txt) in the repository.
In your CI pipeline, run benchmarks on the PR branch and save to a temp file.
Use benchdiff for quick visual PR comparisons (shows delta without strict thresholds).
Use cob (continuous benchmarking) for strict threshold-based gating (e.g., fail if latency increases >5%).
Use gobenchdata to append results to a JSON file and build long-term trend dashboards.
Account for noisy neighbor effects: cloud CI runners vary 5-10% even on quiet machines. Set thresholds conservatively (e.g., >10% regression to fail).
For reproducible benchmarks, use self-hosted runners or tune cloud runners: disable CPU frequency scaling, pin to isolated cores, and warm up the CPU before benchmarking.

Example GitHub Actions workflow:

- name: Run benchmarks
  run: go test -bench=. -benchmem -count=5 ./... | tee new.txt

- name: Compare with baseline
  run: benchstat benchmarks/main.txt new.txt | tee benchstat.txt

- name: Fail if regression > 10%
  run: |
    if grep -E "^\w+.*-[0-9]{1,3}\.[0-9]{2}%" benchstat.txt; then
      echo "Performance regression detected"
      exit 1
    fi

Step 7: Investigate Production Performance (Optional)

Input: Prometheus endpoint at /metrics, PromQL queries, runtime metrics. Output: Correlation between code changes and production behavior.

Ensure Go runtime metrics are exposed via prometheus/client_golang in your HTTP server.
Query Prometheus for key metrics: go_gc_duration_seconds_sum (total GC time), go_goroutines (active goroutines), go_memstats_heap_alloc_bytes (heap size).
Use PromQL to correlate metric spikes with code deployments: rate(go_gc_duration_seconds_sum[5m]) to track GC frequency over time.
Enable GODEBUG flags on production (e.g., GODEBUG=gctrace=1) to log GC events to stderr and correlate with latency increases.
Cross-reference local benchmarks with production metrics to identify why production differs (e.g., larger dataset sizes, different CPU topology, different garbage collection pressure).

Example PromQL:

# GC frequency (higher = more pressure on heap)
rate(go_gc_duration_seconds_count[5m])

# Heap size (sustained high values = memory leak)
go_memstats_heap_alloc_bytes

# Goroutine count (spike = possible goroutine leak)
go_goroutines

Decision Points

If the function call is trivially cheap (nanoseconds), then increase -benchtime to 5-10 seconds for statistical stability and to prevent compiler optimizations from skewing results.

If using Go <1.24, then use for i := 0; i < b.N; i++ loops and add b.ResetTimer() after expensive setup; otherwise use b.Loop().

If the benchmark result shows allocation where none is expected, then run with go test -gcflags="-m" to see escape analysis output and verify the compiler isn't moving stack values to the heap unnecessarily.

If benchstat output shows ~ (not statistically significant), then do not claim the optimization works; re-run with higher -count (e.g., -count=20) or longer -benchtime to reduce noise before drawing conclusions.

If profiling reveals a hot path in third-party code, then before optimizing, check the third-party library's GitHub issues or file a bug (the maintainer may already have a fix in development).

If running benchmarks on shared cloud CI hardware (AWS, GCP, Azure), then expect 5-10% variance between runs even on "quiet" machines; set regression thresholds to >10% to avoid false positives.

If production metrics diverge from local benchmarks (e.g., GC frequency is higher in prod), then enable GODEBUG=gctrace=1 on a canary instance to log GC events and correlate with code changes, heap size, or workload differences.

If tracing reveals goroutine blocking on locks or I/O, then use the golang-troubleshooting skill for Delve debugger setup to inspect goroutine state at runtime.

Output Contract

Benchmark runs produce a plain-text file (e.g., bench.txt) with rows in the format:

BenchmarkName-<GOMAXPROCS> <iterations> <time-per-op> <memory-per-op> <allocs-per-op>

Profile files are binary outputs:

CPU profile: cpu.prof (input to go tool pprof)
Memory profile: mem.prof (input to go tool pprof)
Execution trace: trace.out (input to go tool trace)

benchstat output is a markdown-style table showing three columns per metric (old, new, delta with p-value and sample size):

name            old ns/op   new ns/op   delta
BenchmarkXxx-8  100 ± 5%    90 ± 3%     -10.00% (p=0.000 n=10)

Commit message includes benchstat table in body with hardware context (goos/goarch/cpu line) and only statistically significant results (no ~ rows).

CI regression job outputs a pass/fail status and optionally a benchstat diff URL or artifact for review.

Outcome Signal

Benchmarks run successfully when go test -bench=. ./... completes without errors and outputs a readable table with ns/op, B/op, and allocs/op columns.
Profiling works when go tool pprof <profile-file> opens interactively and top10 shows function names and CPU/memory percentages (not zero values).
Statistical comparison is valid when benchstat output shows p-values (e.g., p=0.000) for significant results and ~ for noise; only then is the optimization claim defensible.
Commit documentation is complete when the commit message includes the benchstat table, hardware context line, and only rows where p < 0.05 (statistically significant).
CI regression gating is active when a pull request is rejected or flagged for review if benchmark latency or memory increases beyond the configured threshold (e.g., >10%).
Production alignment is confirmed when Prometheus metrics (GC frequency, heap size, goroutine count) track similarly to local benchmark behavior after a code change is deployed, or divergence is explained by workload differences (larger data, higher concurrency).