Golang benchmarking, profiling, and performance measurement. Use when writing, running, or comparing Go benchmarks, profiling hot paths with pprof, interpret...
---
name: golang-benchmark
description: "Golang benchmarking, profiling, and performance measurement. Use when writing, running, or comparing Go benchmarks, profiling hot paths with pprof, interpreting CPU/memory/trace profiles, analyzing results with benchstat, setting up CI benchmark regression detection, or investigating production performance with Prometheus runtime metrics. Also use when the developer needs deep analysis on a specific performance indicator - this skill provides the measurement methodology, while `samber/cc-skills-golang@golang-performance` provides the optimization patterns."
user-invocable: true
license: MIT
compatibility: Designed for Claude Code or similar AI coding agents, and for projects using Golang.
metadata:
author: samber
version: "1.2.3"
openclaw:
emoji: "📊"
homepage: https://github.com/samber/cc-skills-golang
requires:
bins:
- go
- benchstat
install:
- kind: go
package: golang.org/x/perf/cmd/benchstat@latest
bins: [benchstat]
allowed-tools: Read Edit Write Glob Grep Bash(go:*) Bash(golangci-lint:*) Bash(git:*) Agent WebFetch Bash(benchstat:*) Bash(benchdiff:*) Bash(cob:*) Bash(gobenchdata:*) Bash(curl:*) mcp__context7__resolve-library-id mcp__context7__query-docs WebSearch AskUserQuestion
---
**Persona:** You are a Go performance measurement engineer. You never draw conclusions from a single benchmark run — statistical rigor and controlled conditions are prerequisites before any optimization decision.
**Thinking mode:** Use `ultrathink` for benchmark analysis, profile interpretation, and performance comparison tasks. Deep reasoning prevents misinterpreting profiling data and ensures statistically sound conclusions.
# Go Benchmarking & Performance Measurement
Performance improvement does not exist without measures — if you can measure it, you can improve it.
This skill covers the full measurement workflow: write a benchmark, run it, profile the result, compare before/after with statistical rigor, and track regressions in CI. For optimization patterns to apply after measurement, → See `samber/cc-skills-golang@golang-performance` skill. For pprof setup on running services, → See `samber/cc-skills-golang@golang-troubleshooting` skill.
## Writing Benchmarks
### `b.Loop()` (Go 1.24+) — preferred
For Go 1.24+, prefer `b.Loop()` for new benchmarks. It times only the loop body and keeps function arguments/results alive, which reduces dead-code-elimination mistakes.
```go
func BenchmarkParse(b *testing.B) {
data := loadFixture("large.json") // setup — excluded from timing
for b.Loop() {
Parse(data) // compiler cannot eliminate this call
}
}
```
Legacy `b.N` loops still compile and are fine to keep when preserving existing benchmarks or supporting Go <1.24. They are easier to get wrong: setup may need `b.ResetTimer()`, and results may need a sink if the compiler can eliminate the work. Go 1.26 fixed an earlier `b.Loop()` inlining limitation — benchmarks on 1.24–1.25 already benefit from `b.Loop()` but may miss inlining optimizations that 1.26 delivers.
### Memory tracking
```go
func BenchmarkAlloc(b *testing.B) {
b.ReportAllocs() // or run with -benchmem flag
var sink []byte
for b.Loop() {
sink = make([]byte, 1024)
}
_ = sink
}
```
`b.ReportMetric()` adds custom metrics (e.g., throughput):
```go
b.ReportMetric(float64(totalBytes)/b.Elapsed().Seconds(), "bytes/s") // b.Elapsed() is only valid inside b.Loop()
```
### Sub-benchmarks and table-driven
```go
func BenchmarkEncode(b *testing.B) {
for _, size := range []int{64, 256, 4096} {
b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
data := make([]byte, size)
for b.Loop() {
Encode(data)
}
})
}
}
```
## Running Benchmarks
```bash
go test -bench=BenchmarkEncode -benchmem -count=10 ./pkg/... | tee bench.txt
```
| Flag | Purpose |
| ---------------------- | ----------------------------------------- |
| `-bench=.` | Run all benchmarks (regexp filter) |
| `-benchmem` | Report allocations (B/op, allocs/op) |
| `-count=10` | Run 10 times for statistical significance |
| `-benchtime=3s` | Minimum time per benchmark (default 1s) |
| `-cpu=1,2,4` | Run with different GOMAXPROCS values |
| `-cpuprofile=cpu.prof` | Write CPU profile |
| `-memprofile=mem.prof` | Write memory profile |
| `-trace=trace.out` | Write execution trace |
**Output format:** `BenchmarkEncode/size=64-8 5000000 230.5 ns/op 128 B/op 2 allocs/op` — the `-8` suffix is GOMAXPROCS, `ns/op` is time per operation, `B/op` is bytes allocated per op, `allocs/op` is heap allocation count per op.
## Documenting Results in Commits
Paste benchstat output in the commit body when the change has a measurable performance impact. This documents _why_ an optimization was made, prevents future readers from reverting it, and lets reviewers verify the claim without re-running benchmarks.
Commit format:
```
perf(parser): reduce Parse allocations 50% with sync.Pool
Replace per-call []byte allocation with a pooled buffer.
goos: linux / goarch: amd64 / cpu: AMD Ryzen 9 5950X
│ old │ new │
│ sec/op │ sec/op vs base │
Parse-32 4.592µ ± 2% 3.041µ ± 1% -33.78% (p=0.000 n=10)
│ old │ new │
│ B/op │ B/op vs base │
Parse-32 1.024Ki ± 0% 0.512Ki ± 0% -50.00% (p=0.000 n=10)
│ old │ new │
│ allocs/op │ allocs/op vs base │
Parse-32 12.00 ± 0% 6.000 ± 0% -50.00% (p=0.000 n=10)
```
**Rules:**
- Only include benchmarks directly affected by the change — strip unrelated rows
- Never paste results with `~` (no statistical significance) — the improvement cannot be claimed
- Include the hardware context line (`goos/goarch/cpu`) so results are reproducible
- Use `perf(scope):` commit type for performance-only changes
## Profiling from Benchmarks
Generate profiles directly from benchmark runs — no HTTP server needed:
```bash
# CPU profile
go test -bench=BenchmarkParse -cpuprofile=cpu.prof ./pkg/parser
go tool pprof cpu.prof
# Memory profile (alloc_objects shows GC churn, inuse_space shows leaks)
go test -bench=BenchmarkParse -memprofile=mem.prof ./pkg/parser
go tool pprof -alloc_objects mem.prof
# Execution trace
go test -bench=BenchmarkParse -trace=trace.out ./pkg/parser
go tool trace trace.out
```
For full pprof CLI reference (all commands, non-interactive mode, profile interpretation), see [pprof Reference](./references/pprof.md). For execution trace interpretation, see [Trace Reference](./references/trace.md). For statistical comparison, see [benchstat Reference](./references/benchstat.md).
## Reference Files
- **[pprof Reference](./references/pprof.md)** — Interactive and non-interactive analysis of CPU, memory, and goroutine profiles. Full CLI commands, profile types (CPU vs alloc*objects vs inuse_space), web UI navigation, and interpretation patterns. Use this to dive deep into \_where* time and memory are being spent in your code.
- **[benchstat Reference](./references/benchstat.md)** — Statistical comparison of benchmark runs with rigorous confidence intervals and p-value tests. Covers output reading, filtering old benchmarks, interleaving results for visual clarity, and regression detection. Use this when you need to prove a change made a meaningful performance difference, not just a lucky run.
- **[Trace Reference](./references/trace.md)** — Execution tracer for understanding _when_ and _why_ code runs. Visualizes goroutine scheduling, garbage collection phases, network blocking, and custom span annotations. Use this when pprof (which shows _where_ CPU goes) isn't enough — you need to see the timeline of what happened.
- **[Diagnostic Tools](./references/tools.md)** — Quick reference for ancillary tools: fieldalignment (struct padding waste), GODEBUG (runtime logging flags), fgprof (frame graph profiles), race detector (concurrency bugs), and others. Use this when you have a specific symptom and need a focused diagnostic — don't reach for pprof if a simpler tool already answers your question.
- **[Compiler Analysis](./references/compiler-analysis.md)** — Low-level compiler optimization insights: escape analysis (when values move to the heap), inlining decisions (which function calls are eliminated), SSA dump (intermediate representation), and assembly output. Use this when benchmarks show allocations you didn't expect, or when you want to verify the compiler did what you intended.
- **[CI Regression Detection](./references/ci-regression.md)** — Automated performance regression gating in CI pipelines. Covers three tools (benchdiff for quick PR comparisons, cob for strict threshold-based gating, gobenchdata for long-term trend dashboards), noisy neighbor mitigation strategies (why cloud CI benchmarks vary 5-10% even on quiet machines), and self-hosted runner tuning to make benchmarks reproducible. Use this when you want to ensure pull requests don't silently slow down your codebase — detecting regressions early prevents shipping performance debt.
- **[Investigation Session](./references/investigation-session.md)** — Production performance troubleshooting workflow combining Prometheus runtime metrics (heap size, GC frequency, goroutine counts), PromQL queries to correlate metrics with code changes, runtime configuration flags (GODEBUG env vars to enable GC logging), and cost warnings (when you're hitting performance tax). Use this when production benchmarks look good but real traffic behaves differently.
- **[Prometheus Go Metrics Reference](./references/prometheus-go-metrics.md)** — Complete listing of Go runtime metrics actually exposed as Prometheus metrics by `prometheus/client_golang`. Covers 30 default metrics, 40+ optional metrics (Go 1.17+), process metrics, and common PromQL queries. Distinguishes between `runtime/metrics` (Go internal data) and Prometheus metrics (what you scrape from `/metrics`). Use this when setting up monitoring dashboards or writing PromQL queries for production alerts.
## Cross-References
- → See `samber/cc-skills-golang@golang-performance` skill for optimization patterns to apply after measuring ("if X bottleneck, apply Y")
- → See `samber/cc-skills-golang@golang-troubleshooting` skill for pprof setup on running services (enable, secure, capture), Delve debugger, GODEBUG flags, root cause methodology
- → See `samber/cc-skills-golang@golang-observability` skill for everyday always-on monitoring, continuous profiling (Pyroscope), distributed tracing (OpenTelemetry)
- → See `samber/cc-skills-golang@golang-testing` skill for general testing practices
- → See `samber/cc-skills@promql-cli` skill for querying Prometheus runtime metrics in production to validate benchmark findings
don't have the plugin yet? install it then click "run inline in claude" again.
restructured original content into implexa's six-component format, added explicit decision branches for go version selection and ci hardware concerns, documented inputs including prometheus setup, clarified edge cases like statistical insignificance and noisy cloud runners, preserved author attribution and all original examples.
This skill covers the full performance measurement workflow in Go: write benchmarks, run them with statistical rigor, profile CPU/memory/goroutine behavior with pprof, compare results before/after using benchstat, detect regressions in CI pipelines, and correlate production metrics with code changes using Prometheus. Use this skill when you need to measure where time and memory are spent, prove that an optimization actually works, or investigate why production performance diverges from local benchmarks. For optimization patterns to apply after measurement, see samber/cc-skills-golang@golang-performance. For pprof setup on running services, see samber/cc-skills-golang@golang-troubleshooting.
go binary (1.24+ preferred for b.Loop(), but 1.20+ supported)golang.org/x/perf/cmd/benchstat@latest for statistical comparisongraphviz (for pprof web UI), curl (if pulling metrics from live services)/metrics with prometheus/client_golang wired into your HTTP serverInput: Source file with target function to measure (e.g., pkg/parser/parser.go).
Output: Test file with benchmark function (e.g., pkg/parser/parser_test.go).
BenchmarkXxx in a *_test.go file.b.Loop() to wrap the code being measured. Setup code (loading fixtures, allocating buffers) goes outside the loop.for i := 0; i < b.N; i++ pattern. Call b.ResetTimer() after expensive setup if needed.b.ReportAllocs() in the function body or use the -benchmem flag at runtime.b.ReportMetric(value, unit) inside the loop.b.Run(name, func(b *testing.B) { ... }) to create sub-benchmarks with different inputs.Example:
func BenchmarkParse(b *testing.B) {
data := loadFixture("large.json")
b.ReportAllocs()
for b.Loop() {
Parse(data)
}
}
func BenchmarkParseVariations(b *testing.B) {
for _, size := range []int{64, 256, 4096} {
b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
data := make([]byte, size)
for b.Loop() {
Encode(data)
}
})
}
}
Input: Benchmark function(s) from Step 1, target package path.
Output: Plain text benchmark results saved to file (e.g., bench.txt).
go test -bench=<BenchmarkName or .> -benchmem -count=10 ./pkg/... | tee bench.txt to execute all matching benchmarks 10 times.-benchtime=3s (or higher) if the default 1 second per benchmark is too short for statistical significance.-cpu=1,2,4,8 to measure how the code scales with GOMAXPROCS if concurrency is relevant.grep cpu /proc/cpuinfo | head -1 and check scaling frequency).benchstat.Example output:
BenchmarkParse-8 5000000 230.5 ns/op 128 B/op 2 allocs/op
BenchmarkParseVariations/size=64-8 10000000 105.2 ns/op 64 B/op 1 allocs/op
BenchmarkParseVariations/size=256-8 2000000 598.3 ns/op 256 B/op 1 allocs/op
Input: Benchmark function(s), target package path, desired profile type (CPU, memory, trace).
Output: Profile file (.prof or .out extension).
go test -bench=<BenchmarkName> -cpuprofile=cpu.prof ./pkg/....go test -bench=<BenchmarkName> -memprofile=mem.prof ./pkg/....go test -bench=<BenchmarkName> -trace=trace.out ./pkg/....go tool pprof <profile-file> and use pprof commands (top, list, graph, web) to identify hot paths.-alloc_objects (total allocations, includes freed memory) and -inuse_space (live memory at peak, detects leaks).go tool trace trace.out to open the web UI and visualize goroutine execution timeline, GC pauses, and blocking events.Example:
go test -bench=BenchmarkParse -cpuprofile=cpu.prof ./pkg/parser
go tool pprof cpu.prof
# (in pprof) top10 # shows top 10 functions by CPU time
# (in pprof) list Parse # shows annotated source of Parse function
# (in pprof) web # generates a call graph (requires graphviz)
Input: Two or more benchmark result files (e.g., old.txt, new.txt).
Output: Statistical comparison table showing p-value, confidence interval, and percentage change.
old.txt.new.txt.benchstat old.txt new.txt to compute confidence intervals and p-values.~ appears next to the percentage, the change is not statistically significant (noise). If (p=0.000 n=10) appears, the result is significant at alpha=0.05.p < 0.05 (not ~).Example output:
name old ns/op new ns/op delta
BenchmarkParse-8 230.5 ± 2% 152.3 ± 1% -33.99% (p=0.000 n=10)
name old B/op new B/op delta
BenchmarkParse-8 128 ± 0% 64 ± 0% -50.00% (p=0.000 n=10)
name old allocs new allocs delta
BenchmarkParse-8 2.00 ± 0% 1.00 ± 0% -50.00% (p=0.000 n=10)
Input: benchstat output, commit message, code change. Output: Commit with performance impact documented.
goos: linux / goarch: amd64 / cpu: AMD Ryzen 9 5950X).~ symbol).perf(scope): as the commit type for performance-only changes.Example commit:
perf(parser): reduce Parse allocations 50% with sync.Pool
Replace per-call []byte allocation with a pooled buffer.
goos: linux / goarch: amd64 / cpu: AMD Ryzen 9 5950X
│ old │ new │
│ sec/op │ sec/op vs base │
Parse-32 4.592µ ± 2% 3.041µ ± 1% -33.78% (p=0.000 n=10)
│ old │ new │
│ B/op │ B/op vs base │
Parse-32 1.024Ki ± 0% 0.512Ki ± 0% -50.00% (p=0.000 n=0)
Input: Benchmarks from previous steps, CI pipeline configuration (GitHub Actions, GitLab CI, etc.), baseline benchmark results. Output: CI job that gates pull requests if performance regresses beyond threshold.
benchmarks/main.txt) in the repository.benchdiff for quick visual PR comparisons (shows delta without strict thresholds).cob (continuous benchmarking) for strict threshold-based gating (e.g., fail if latency increases >5%).gobenchdata to append results to a JSON file and build long-term trend dashboards.Example GitHub Actions workflow:
- name: Run benchmarks
run: go test -bench=. -benchmem -count=5 ./... | tee new.txt
- name: Compare with baseline
run: benchstat benchmarks/main.txt new.txt | tee benchstat.txt
- name: Fail if regression > 10%
run: |
if grep -E "^\w+.*-[0-9]{1,3}\.[0-9]{2}%" benchstat.txt; then
echo "Performance regression detected"
exit 1
fi
Input: Prometheus endpoint at /metrics, PromQL queries, runtime metrics.
Output: Correlation between code changes and production behavior.
prometheus/client_golang in your HTTP server.go_gc_duration_seconds_sum (total GC time), go_goroutines (active goroutines), go_memstats_heap_alloc_bytes (heap size).rate(go_gc_duration_seconds_sum[5m]) to track GC frequency over time.GODEBUG=gctrace=1) to log GC events to stderr and correlate with latency increases.Example PromQL:
# GC frequency (higher = more pressure on heap)
rate(go_gc_duration_seconds_count[5m])
# Heap size (sustained high values = memory leak)
go_memstats_heap_alloc_bytes
# Goroutine count (spike = possible goroutine leak)
go_goroutines
If the function call is trivially cheap (nanoseconds), then increase -benchtime to 5-10 seconds for statistical stability and to prevent compiler optimizations from skewing results.
If using Go <1.24, then use for i := 0; i < b.N; i++ loops and add b.ResetTimer() after expensive setup; otherwise use b.Loop().
If the benchmark result shows allocation where none is expected, then run with go test -gcflags="-m" to see escape analysis output and verify the compiler isn't moving stack values to the heap unnecessarily.
If benchstat output shows ~ (not statistically significant), then do not claim the optimization works; re-run with higher -count (e.g., -count=20) or longer -benchtime to reduce noise before drawing conclusions.
If profiling reveals a hot path in third-party code, then before optimizing, check the third-party library's GitHub issues or file a bug (the maintainer may already have a fix in development).
If running benchmarks on shared cloud CI hardware (AWS, GCP, Azure), then expect 5-10% variance between runs even on "quiet" machines; set regression thresholds to >10% to avoid false positives.
If production metrics diverge from local benchmarks (e.g., GC frequency is higher in prod), then enable GODEBUG=gctrace=1 on a canary instance to log GC events and correlate with code changes, heap size, or workload differences.
If tracing reveals goroutine blocking on locks or I/O, then use the golang-troubleshooting skill for Delve debugger setup to inspect goroutine state at runtime.
Benchmark runs produce a plain-text file (e.g., bench.txt) with rows in the format:
BenchmarkName-<GOMAXPROCS> <iterations> <time-per-op> <memory-per-op> <allocs-per-op>
Profile files are binary outputs:
cpu.prof (input to go tool pprof)mem.prof (input to go tool pprof)trace.out (input to go tool trace)benchstat output is a markdown-style table showing three columns per metric (old, new, delta with p-value and sample size):
name old ns/op new ns/op delta
BenchmarkXxx-8 100 ± 5% 90 ± 3% -10.00% (p=0.000 n=10)
Commit message includes benchstat table in body with hardware context (goos/goarch/cpu line) and only statistically significant results (no ~ rows).
CI regression job outputs a pass/fail status and optionally a benchstat diff URL or artifact for review.
go test -bench=. ./... completes without errors and outputs a readable table with ns/op, B/op, and allocs/op columns.go tool pprof <profile-file> opens interactively and top10 shows function names and CPU/memory percentages (not zero values).p=0.000) for significant results and ~ for noise; only then is the optimization claim defensible.p < 0.05 (statistically significant).