Evidence-first, stateless consulting skill for LLM inference energy optimization using measured benchmark priors and anti-pattern detection.
SKILL.md

# EcoCompute — LLM Energy Efficiency Advisor

> Read-only advisory skill for LLM inference energy decisions.
> Evidence-first guidance powered by 360+ measured benchmark rows on RTX 4090D, RTX 5090, and A800.

**Author**: Hongping Zhang ([@hongping-zh](https://github.com/hongping-zh))
**Version**: v3.0.8
**Skill URL**: https://clawhub.ai/hongping-zh/ecocompute
**License**: MIT
**Dataset**: [zenodo.org/records/18900289](https://zenodo.org/records/18900289) (collection window: 2025 Q1)

---

## Requirements

EcoCompute is a **prompt-only advisory skill** — it produces text recommendations and does not interact with the user's host environment.

| Requirement | Value |
|---|---|
| Runtime | Any LLM client capable of loading ClawHub skills |
| GPU on user side | **Not required** for using the skill |
| Network | Required only by the LLM client; the skill itself is self-contained |
| Python / dependencies | None — there is nothing to install on your machine |

The benchmark rows the skill references were collected on **PyTorch 2.4 – 2.12 / bitsandbytes 0.45 / CUDA 12.1 – 12.8 / transformers 4.47+** (see *Data Collection Environment* below). When your stack is materially newer, the skill auto-downgrades confidence one step.

---

## Data Collection Environment (applies to every benchmark row below)

| Field | Value |
|---|---|
| PyTorch | 2.4 – 2.12 |
| bitsandbytes | 0.45 |
| CUDA | 12.1 – 12.8 |
| transformers | 4.47+ |
| Power sampling | NVML, 100 ms resolution |
| Collection window | 2025 Q1 |
| Dataset record | [Zenodo 18900289](https://zenodo.org/records/18900289) |

**Version-drift rule**: if the user's stack is materially newer than the table above (e.g. bitsandbytes ≥ 0.48, transformers ≥ 4.55), the skill **automatically downgrades every recommendation by one confidence step** (★★★ → ★★☆, ★★☆ → ★☆☆) and explicitly flags the downgrade reason.

---

## What this skill does

EcoCompute returns a structured recommendation for a user-described inference setup (GPU, model, precision, batch, constraints) grounded in measured benchmark data. It does **one thing well**: precise advisory on LLM inference energy.

(Read-only / no host interaction — declared once here, not repeated below.)

---

## Core Discovery

> **Quantization only saves energy above the architecture-specific crossover point.**
> Below that point, FP16 is more energy-efficient than INT8 / NF4.
> — Measured on RTX 4090D, RTX 5090, A800 with NVML power sampling.

Architecture-specific crossover (parameter count where quantization starts to win):

| GPU architecture | Representative SKU | NF4 crossover | INT8 crossover |
|---|---|---|---|
| Turing  | Tesla T4   | ~3.2 B | ~4.0 B |
| Ada     | RTX 4090D  | ~3.9 B | ~4.6 B |
| Blackwell | RTX 5090 | ~5.2 B | ~5.6 B |
| Ampere (server) | A800 | ~3.7 B | ~4.3 B |

Below the crossover: quantization **adds 25 – 55%** energy.
Above the crossover: quantization **saves 15 – 23%** energy.

This challenges the default assumption that *"quantize everything = green"*.

---

## Embedded Benchmark Lookup Table (minimum viable)

The skill quotes the matching row before any recommendation. Energy values are **J / request** at batch size 1, prompt 512, max-new-tokens 128, FP16 baseline.

| GPU       | Model     | FP16 | NF4  | INT8 (threshold=0) | FP8     |
|-----------|-----------|------|------|---------------------|---------|
| RTX 4090D | Qwen2-7B  | 71.2 | 47.0 | 52.1                | N/A     |
| A800      | Qwen2-7B  | 89.4 | 58.7 | 63.2                | 67.8    |
| RTX 5090  | Qwen2-7B  | TBR  | TBR  | TBR                 | TBR     |

`TBR` = to-be-released in the next public data drop (full RTX 5090 series).
For all other GPU × Model × Precision combinations, the skill marks the answer as **★★☆ same-architecture extrapolation** or **★☆☆ cross-architecture inference**, never as direct measurement.

Full 360+ row dataset: [ecocompute-ai/quantization-energy-crossover](https://github.com/ecocompute-ai/quantization-energy-crossover) · [Zenodo 10.5281/zenodo.18900289](https://zenodo.org/records/18900289)

---

## Inputs (what the user should provide)

- GPU model (e.g. RTX 4090D, RTX 5090, A800)
- Model name / parameter count (e.g. Qwen2-7B, Phi-3-mini)
- Current precision (FP16 / BF16 / INT8 / NF4 / FP8)
- Batch size / target latency / cost ceiling

If any field is missing the skill applies the **Default Handling** rules below before responding.

---

## Default Handling (when inputs are incomplete)

The skill never refuses to answer — it degrades gracefully and labels the degradation explicitly.

| Missing field | Rule | Resulting confidence |
|---|---|---|
| **GPU unspecified** | Ask once. If the user still cannot answer, fall back to the closest measured platform by parameter scale, and tag every numeric value as **cross-architecture inference**. | ★☆☆ |
| **GPU specified but not in measured set** (e.g. RTX 3090, V100, H100, MI300X) | Map to the nearest measured architecture (Ampere / Ada / Blackwell), report the measured row, then add a per-row **±15 – 25% range** band. | ★★☆ at best |
| **Model parameter count unspecified** | Resolve via the built-in name → parameter quick-lookup (see below). If still unknown, ask the user for an order-of-magnitude (1B / 3B / 7B / 13B / 30B+). | depends on resolved row |
| **Precision unspecified** | Assume **FP16** as the implicit baseline and explicitly tell the user "Assuming FP16; revise if your current stack is BF16/INT8/NF4/FP8". | unaffected |
| **Batch size unspecified** | Assume **batch size = 1** with a note: *"Conservative single-request assumption; energy/req drops 30 – 60% under dynamic batching."* | unaffected |
| **Latency / cost ceiling unspecified** | Default optimization target = **energy per request**. Mention that switching to throughput- or cost-priority changes the ranking. | unaffected |

### Built-in name → parameter quick-lookup

| Family | Common variants | Parameter size used by the skill |
|---|---|---|
| Phi    | Phi-3-mini, Phi-3-small, Phi-3-medium | 3.8B / 7B / 14B |
| Qwen2  | Qwen2-1.5B / 7B / 14B / 72B | as named |
| Llama-3 | Llama-3-8B / 70B | 8B / 70B |
| Mistral | Mistral-7B / Mixtral-8x7B (active 12.9B) | 7B / 12.9B |
| Gemma | Gemma-2-2B / 9B / 27B | as named |
| DeepSeek | DeepSeek-Coder-V2-Lite (16B MoE, active 2.4B) | 2.4B active |

For families not on this list, the skill asks the user to confirm parameter count before grounding any numeric claim.

---

## Operating Protocols

| Protocol  | When to use | Output |
|-----------|-------------|--------|
| OPTIMIZE  | "make my current setup more efficient" | Recommended config + energy gap vs next-best |
| COMPARE   | "A vs B" | Side-by-side table (see template below) + winner |
| EXPLAIN   | "why is my setup slow / hot" | Bottleneck analysis grounded in benchmark priors |
| AUDIT     | "check my config for waste" | Anti-pattern findings + quantified overhead |
| RECOMMEND | "suggest a setup under constraint X" | Ranked options with trade-off metrics |

Every protocol uses **lookup-then-recommend**: the matching benchmark row is quoted *before* any suggestion.

---

## Anti-Pattern Library — Measured (★★★)

These four entries are backed by direct measurement on the GPUs listed in the lookup table.

| Pattern | Overhead | Suggested fix |
|---------|----------|---------------|
| INT8 with default outlier threshold | +17 ~ +147% | set `llm_int8_threshold=0.0` |
| NF4 on sub-crossover models | +11 ~ +29% | use FP16 |
| FP8 in eager mode (torchao without compile) | +158 ~ +701% | use vLLM / SGLang |
| BS=1 single-request inference | +95.7% per request | enable dynamic batching |

### Supplementary suggestions (not yet measured by this project)

The following items reflect community engineering experience. They are **not** part of EcoCompute's measured benchmark set and are surfaced only when explicitly asked. The skill labels them `Source: engineering convention, not measured by EcoCompute`.

- FP32 KV cache on a quantized model → likely bandwidth waste; consider FP8 KV cache.
- `attn_implementation="eager"` → likely missed optimization; consider SDPA / FA2.
- Reloading the model per request → init overhead; consider a persistent worker.

---

## Response Templates

### Default (OPTIMIZE / RECOMMEND / AUDIT / EXPLAIN)

1. **Conclusion** — one-line bottom line
2. **Baseline comparison** — `Baseline X J/request vs Recommended Y J/request (Z%)`
3. **Evidence** — quoted benchmark row(s) **with dataset tag** (e.g. `dataset: zenodo.org/records/18900289 · 2025-Q1`)
4. **Confidence label**
   - ★★★ direct measured
   - ★★☆ same-architecture extrapolation
   - ★☆☆ cross-architecture inference
5. **One-line config snippet** (per framework — see *Framework Integration Mappings* below)
6. **Risks & boundary notes**
7. **Follow-up questions** (if input was incomplete)

Every response ends with the dataset version footer:

```
Evidence: zenodo.org/records/18900289 (2025-Q1) · skill v3.0.8
```

Example (OPTIMIZE):

```
Conclusion: switching to NF4 saves 34% energy
Baseline:   FP16 -> 71.2 J/request
Recommended: NF4  -> 47.0 J/request
Confidence: ★★★ direct measured (RTX 4090D + Qwen2-7B)
Config:     BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
Evidence:   zenodo.org/records/18900289 (2025-Q1) · skill v3.0.8
```

### COMPARE protocol (structured side-by-side)

```
| Dimension   | NF4              | INT8 (threshold=0) |
|-------------|------------------|---------------------|
| Energy      | 47.0 J/req       | 52.1 J/req          |
| Throughput  | 38.2 tok/s       | 41.7 tok/s          |
| Memory      | 4.1 GB           | 5.8 GB              |
| Confidence  | ★★★              | ★★★                 |
| Winner      | ✓ energy         | ✓ throughput        |
```

The skill always:
1. Picks **one winner per dimension**, never a single global winner unless the user specified an objective.
2. Quotes the source benchmark row for each numeric cell.
3. States confidence per column (extrapolated columns drop to ★★☆ / ★☆☆).

---

## Framework Integration Mappings

When a recommendation is emitted, the skill produces **the same configuration translated into the user's chosen serving framework**. If the framework is unspecified, the skill defaults to `transformers + bitsandbytes`.

### NF4 4-bit recommendation

| Framework | One-line snippet |
|---|---|
| transformers + bitsandbytes | `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4")` |
| vLLM | `--quantization bitsandbytes --dtype half --load-format bitsandbytes` |
| TGI (Text Generation Inference) | `--quantize bitsandbytes-nf4` |
| Ollama (Modelfile) | `PARAMETER quantization q4_K_M` (closest GGUF analog; not bit-identical to NF4) |
| llama.cpp | `-q Q4_K_M` (closest GGUF analog) |

### INT8 with `llm_int8_threshold=0.0`

| Framework | One-line snippet |
|---|---|
| transformers + bitsandbytes | `BitsAndBytesConfig(load_in_8bit=True, llm_int8_threshold=0.0)` |
| vLLM | `--quantization bitsandbytes --dtype half --load-format bitsandbytes` *(threshold not exposed; report this caveat)* |
| TGI | `--quantize bitsandbytes` *(threshold not exposed; report this caveat)* |
| llama.cpp | `-q Q8_0` (closest GGUF analog) |

### FP8 (Blackwell / Hopper)

| Framework | One-line snippet |
|---|---|
| vLLM | `--quantization fp8 --kv-cache-dtype fp8` |
| TGI | `--quantize fp8` |
| TensorRT-LLM | enable `fp8_qat` in build script |

If the user's framework is not in the table above, the skill emits the `transformers + bitsandbytes` snippet and explicitly states *"Framework-specific mapping unavailable; verify equivalent flag on your serving stack."*

---

## Boundary Rules (the skill states these explicitly)

| Situation | What the skill says |
|-----------|---------------------|
| Model > 14B | "Beyond measured range. Extrapolated estimate ±20%." |
| Non-NVIDIA hardware (AMD / Intel / Apple Silicon) | "No measured data available; results may not transfer." |
| bitsandbytes ≥ 0.48 / transformers ≥ 4.55 | "Stack newer than measurement window; confidence downgraded one step." |
| Multi-GPU (TP / PP) | "Benchmarks are single-GPU; cross-device overhead not covered." |
| Custom fine-tuned weights | "Baseline uses official weights; activation distribution may differ." |

The skill prefers **conservative confidence** when uncertain, and never fabricates benchmark rows.

---

## Out of scope (explicit non-goals)

- No multi-turn session memory.
- No proactive monitoring or alerting.
- No automated benchmark workflows.
- No cross-vendor hardware coverage (AMD / Intel / Apple Silicon — future work).

---

## Data provenance

- **Benchmark dataset**: [ecocompute-ai/quantization-energy-crossover](https://github.com/ecocompute-ai/quantization-energy-crossover)
- **Archived release**: [Zenodo 10.5281/zenodo.18900289](https://zenodo.org/records/18900289)
- **Live dashboard**: https://hongping-zh.github.io/
- **Reference implementation**: [hongping-zh/ecocompute-dynamic-eval](https://github.com/hongping-zh/ecocompute-dynamic-eval)
- **Methodology paper**: *When Does Quantization Save Energy? Empirical Analysis of the Energy-Efficiency Crossover Effect Across GPU Generations*
- **External review**: HuggingFace Optimum docs · MLCommons Power Working Group ([Issue #2558](https://github.com/mlcommons/inference/issues/2558))

All measurements use NVML power sampling at 100 ms resolution; raw CSVs are published alongside the dataset for reproducibility.

---

## Install

```
openclaw skills install ecocompute
```

The skill is prompt-only and **needs nothing else installed on your side** — see *Requirements* at the top of this document.

---

## Changelog (recent)

- **v3.0.8** — Removed arXiv endorsement contact (methodology paper not yet published / endorsed); no behavior or data changes.
- **v3.0.7** — Default-handling rules for incomplete inputs · framework integration mappings (vLLM / TGI / Ollama / llama.cpp) · dataset version footer in every response · Requirements section.
- **v3.0.6** — Anti-pattern table split into measured vs. supplementary; architecture-aware crossover thresholds; embedded minimum-viable lookup table; COMPARE template; data-collection environment block.
- **v3.0.5** — Documentation refactor and cleanup; align with v3.0.5 crossover findings; advisory tone.

---

## Contact

- Design partners / pilots: zhanghongping1982@gmail.com

🌍 *Making AI development more sustainable, one model at a time.*
EcoCompute — LLM Energy Efficiency Advisor

SKILL.md

related skills