Evidence-first, stateless consulting skill for LLM inference energy optimization using measured benchmark priors and anti-pattern detection.
# EcoCompute — LLM Energy Efficiency Advisor > Read-only advisory skill for LLM inference energy decisions. > Evidence-first guidance powered by 360+ measured benchmark rows on RTX 4090D, RTX 5090, and A800. **Author**: Hongping Zhang ([@hongping-zh](https://github.com/hongping-zh)) **Version**: v3.0.8 **Skill URL**: https://clawhub.ai/hongping-zh/ecocompute **License**: MIT **Dataset**: [zenodo.org/records/18900289](https://zenodo.org/records/18900289) (collection window: 2025 Q1) --- ## Requirements EcoCompute is a **prompt-only advisory skill** — it produces text recommendations and does not interact with the user's host environment. | Requirement | Value | |---|---| | Runtime | Any LLM client capable of loading ClawHub skills | | GPU on user side | **Not required** for using the skill | | Network | Required only by the LLM client; the skill itself is self-contained | | Python / dependencies | None — there is nothing to install on your machine | The benchmark rows the skill references were collected on **PyTorch 2.4 – 2.12 / bitsandbytes 0.45 / CUDA 12.1 – 12.8 / transformers 4.47+** (see *Data Collection Environment* below). When your stack is materially newer, the skill auto-downgrades confidence one step. --- ## Data Collection Environment (applies to every benchmark row below) | Field | Value | |---|---| | PyTorch | 2.4 – 2.12 | | bitsandbytes | 0.45 | | CUDA | 12.1 – 12.8 | | transformers | 4.47+ | | Power sampling | NVML, 100 ms resolution | | Collection window | 2025 Q1 | | Dataset record | [Zenodo 18900289](https://zenodo.org/records/18900289) | **Version-drift rule**: if the user's stack is materially newer than the table above (e.g. bitsandbytes ≥ 0.48, transformers ≥ 4.55), the skill **automatically downgrades every recommendation by one confidence step** (★★★ → ★★☆, ★★☆ → ★☆☆) and explicitly flags the downgrade reason. --- ## What this skill does EcoCompute returns a structured recommendation for a user-described inference setup (GPU, model, precision, batch, constraints) grounded in measured benchmark data. It does **one thing well**: precise advisory on LLM inference energy. (Read-only / no host interaction — declared once here, not repeated below.) --- ## Core Discovery > **Quantization only saves energy above the architecture-specific crossover point.** > Below that point, FP16 is more energy-efficient than INT8 / NF4. > — Measured on RTX 4090D, RTX 5090, A800 with NVML power sampling. Architecture-specific crossover (parameter count where quantization starts to win): | GPU architecture | Representative SKU | NF4 crossover | INT8 crossover | |---|---|---|---| | Turing | Tesla T4 | ~3.2 B | ~4.0 B | | Ada | RTX 4090D | ~3.9 B | ~4.6 B | | Blackwell | RTX 5090 | ~5.2 B | ~5.6 B | | Ampere (server) | A800 | ~3.7 B | ~4.3 B | Below the crossover: quantization **adds 25 – 55%** energy. Above the crossover: quantization **saves 15 – 23%** energy. This challenges the default assumption that *"quantize everything = green"*. --- ## Embedded Benchmark Lookup Table (minimum viable) The skill quotes the matching row before any recommendation. Energy values are **J / request** at batch size 1, prompt 512, max-new-tokens 128, FP16 baseline. | GPU | Model | FP16 | NF4 | INT8 (threshold=0) | FP8 | |-----------|-----------|------|------|---------------------|---------| | RTX 4090D | Qwen2-7B | 71.2 | 47.0 | 52.1 | N/A | | A800 | Qwen2-7B | 89.4 | 58.7 | 63.2 | 67.8 | | RTX 5090 | Qwen2-7B | TBR | TBR | TBR | TBR | `TBR` = to-be-released in the next public data drop (full RTX 5090 series). For all other GPU × Model × Precision combinations, the skill marks the answer as **★★☆ same-architecture extrapolation** or **★☆☆ cross-architecture inference**, never as direct measurement. Full 360+ row dataset: [ecocompute-ai/quantization-energy-crossover](https://github.com/ecocompute-ai/quantization-energy-crossover) · [Zenodo 10.5281/zenodo.18900289](https://zenodo.org/records/18900289) --- ## Inputs (what the user should provide) - GPU model (e.g. RTX 4090D, RTX 5090, A800) - Model name / parameter count (e.g. Qwen2-7B, Phi-3-mini) - Current precision (FP16 / BF16 / INT8 / NF4 / FP8) - Batch size / target latency / cost ceiling If any field is missing the skill applies the **Default Handling** rules below before responding. --- ## Default Handling (when inputs are incomplete) The skill never refuses to answer — it degrades gracefully and labels the degradation explicitly. | Missing field | Rule | Resulting confidence | |---|---|---| | **GPU unspecified** | Ask once. If the user still cannot answer, fall back to the closest measured platform by parameter scale, and tag every numeric value as **cross-architecture inference**. | ★☆☆ | | **GPU specified but not in measured set** (e.g. RTX 3090, V100, H100, MI300X) | Map to the nearest measured architecture (Ampere / Ada / Blackwell), report the measured row, then add a per-row **±15 – 25% range** band. | ★★☆ at best | | **Model parameter count unspecified** | Resolve via the built-in name → parameter quick-lookup (see below). If still unknown, ask the user for an order-of-magnitude (1B / 3B / 7B / 13B / 30B+). | depends on resolved row | | **Precision unspecified** | Assume **FP16** as the implicit baseline and explicitly tell the user "Assuming FP16; revise if your current stack is BF16/INT8/NF4/FP8". | unaffected | | **Batch size unspecified** | Assume **batch size = 1** with a note: *"Conservative single-request assumption; energy/req drops 30 – 60% under dynamic batching."* | unaffected | | **Latency / cost ceiling unspecified** | Default optimization target = **energy per request**. Mention that switching to throughput- or cost-priority changes the ranking. | unaffected | ### Built-in name → parameter quick-lookup | Family | Common variants | Parameter size used by the skill | |---|---|---| | Phi | Phi-3-mini, Phi-3-small, Phi-3-medium | 3.8B / 7B / 14B | | Qwen2 | Qwen2-1.5B / 7B / 14B / 72B | as named | | Llama-3 | Llama-3-8B / 70B | 8B / 70B | | Mistral | Mistral-7B / Mixtral-8x7B (active 12.9B) | 7B / 12.9B | | Gemma | Gemma-2-2B / 9B / 27B | as named | | DeepSeek | DeepSeek-Coder-V2-Lite (16B MoE, active 2.4B) | 2.4B active | For families not on this list, the skill asks the user to confirm parameter count before grounding any numeric claim. --- ## Operating Protocols | Protocol | When to use | Output | |-----------|-------------|--------| | OPTIMIZE | "make my current setup more efficient" | Recommended config + energy gap vs next-best | | COMPARE | "A vs B" | Side-by-side table (see template below) + winner | | EXPLAIN | "why is my setup slow / hot" | Bottleneck analysis grounded in benchmark priors | | AUDIT | "check my config for waste" | Anti-pattern findings + quantified overhead | | RECOMMEND | "suggest a setup under constraint X" | Ranked options with trade-off metrics | Every protocol uses **lookup-then-recommend**: the matching benchmark row is quoted *before* any suggestion. --- ## Anti-Pattern Library — Measured (★★★) These four entries are backed by direct measurement on the GPUs listed in the lookup table. | Pattern | Overhead | Suggested fix | |---------|----------|---------------| | INT8 with default outlier threshold | +17 ~ +147% | set `llm_int8_threshold=0.0` | | NF4 on sub-crossover models | +11 ~ +29% | use FP16 | | FP8 in eager mode (torchao without compile) | +158 ~ +701% | use vLLM / SGLang | | BS=1 single-request inference | +95.7% per request | enable dynamic batching | ### Supplementary suggestions (not yet measured by this project) The following items reflect community engineering experience. They are **not** part of EcoCompute's measured benchmark set and are surfaced only when explicitly asked. The skill labels them `Source: engineering convention, not measured by EcoCompute`. - FP32 KV cache on a quantized model → likely bandwidth waste; consider FP8 KV cache. - `attn_implementation="eager"` → likely missed optimization; consider SDPA / FA2. - Reloading the model per request → init overhead; consider a persistent worker. --- ## Response Templates ### Default (OPTIMIZE / RECOMMEND / AUDIT / EXPLAIN) 1. **Conclusion** — one-line bottom line 2. **Baseline comparison** — `Baseline X J/request vs Recommended Y J/request (Z%)` 3. **Evidence** — quoted benchmark row(s) **with dataset tag** (e.g. `dataset: zenodo.org/records/18900289 · 2025-Q1`) 4. **Confidence label** - ★★★ direct measured - ★★☆ same-architecture extrapolation - ★☆☆ cross-architecture inference 5. **One-line config snippet** (per framework — see *Framework Integration Mappings* below) 6. **Risks & boundary notes** 7. **Follow-up questions** (if input was incomplete) Every response ends with the dataset version footer: ``` Evidence: zenodo.org/records/18900289 (2025-Q1) · skill v3.0.8 ``` Example (OPTIMIZE): ``` Conclusion: switching to NF4 saves 34% energy Baseline: FP16 -> 71.2 J/request Recommended: NF4 -> 47.0 J/request Confidence: ★★★ direct measured (RTX 4090D + Qwen2-7B) Config: BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16) Evidence: zenodo.org/records/18900289 (2025-Q1) · skill v3.0.8 ``` ### COMPARE protocol (structured side-by-side) ``` | Dimension | NF4 | INT8 (threshold=0) | |-------------|------------------|---------------------| | Energy | 47.0 J/req | 52.1 J/req | | Throughput | 38.2 tok/s | 41.7 tok/s | | Memory | 4.1 GB | 5.8 GB | | Confidence | ★★★ | ★★★ | | Winner | ✓ energy | ✓ throughput | ``` The skill always: 1. Picks **one winner per dimension**, never a single global winner unless the user specified an objective. 2. Quotes the source benchmark row for each numeric cell. 3. States confidence per column (extrapolated columns drop to ★★☆ / ★☆☆). --- ## Framework Integration Mappings When a recommendation is emitted, the skill produces **the same configuration translated into the user's chosen serving framework**. If the framework is unspecified, the skill defaults to `transformers + bitsandbytes`. ### NF4 4-bit recommendation | Framework | One-line snippet | |---|---| | transformers + bitsandbytes | `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4")` | | vLLM | `--quantization bitsandbytes --dtype half --load-format bitsandbytes` | | TGI (Text Generation Inference) | `--quantize bitsandbytes-nf4` | | Ollama (Modelfile) | `PARAMETER quantization q4_K_M` (closest GGUF analog; not bit-identical to NF4) | | llama.cpp | `-q Q4_K_M` (closest GGUF analog) | ### INT8 with `llm_int8_threshold=0.0` | Framework | One-line snippet | |---|---| | transformers + bitsandbytes | `BitsAndBytesConfig(load_in_8bit=True, llm_int8_threshold=0.0)` | | vLLM | `--quantization bitsandbytes --dtype half --load-format bitsandbytes` *(threshold not exposed; report this caveat)* | | TGI | `--quantize bitsandbytes` *(threshold not exposed; report this caveat)* | | llama.cpp | `-q Q8_0` (closest GGUF analog) | ### FP8 (Blackwell / Hopper) | Framework | One-line snippet | |---|---| | vLLM | `--quantization fp8 --kv-cache-dtype fp8` | | TGI | `--quantize fp8` | | TensorRT-LLM | enable `fp8_qat` in build script | If the user's framework is not in the table above, the skill emits the `transformers + bitsandbytes` snippet and explicitly states *"Framework-specific mapping unavailable; verify equivalent flag on your serving stack."* --- ## Boundary Rules (the skill states these explicitly) | Situation | What the skill says | |-----------|---------------------| | Model > 14B | "Beyond measured range. Extrapolated estimate ±20%." | | Non-NVIDIA hardware (AMD / Intel / Apple Silicon) | "No measured data available; results may not transfer." | | bitsandbytes ≥ 0.48 / transformers ≥ 4.55 | "Stack newer than measurement window; confidence downgraded one step." | | Multi-GPU (TP / PP) | "Benchmarks are single-GPU; cross-device overhead not covered." | | Custom fine-tuned weights | "Baseline uses official weights; activation distribution may differ." | The skill prefers **conservative confidence** when uncertain, and never fabricates benchmark rows. --- ## Out of scope (explicit non-goals) - No multi-turn session memory. - No proactive monitoring or alerting. - No automated benchmark workflows. - No cross-vendor hardware coverage (AMD / Intel / Apple Silicon — future work). --- ## Data provenance - **Benchmark dataset**: [ecocompute-ai/quantization-energy-crossover](https://github.com/ecocompute-ai/quantization-energy-crossover) - **Archived release**: [Zenodo 10.5281/zenodo.18900289](https://zenodo.org/records/18900289) - **Live dashboard**: https://hongping-zh.github.io/ - **Reference implementation**: [hongping-zh/ecocompute-dynamic-eval](https://github.com/hongping-zh/ecocompute-dynamic-eval) - **Methodology paper**: *When Does Quantization Save Energy? Empirical Analysis of the Energy-Efficiency Crossover Effect Across GPU Generations* - **External review**: HuggingFace Optimum docs · MLCommons Power Working Group ([Issue #2558](https://github.com/mlcommons/inference/issues/2558)) All measurements use NVML power sampling at 100 ms resolution; raw CSVs are published alongside the dataset for reproducibility. --- ## Install ``` openclaw skills install ecocompute ``` The skill is prompt-only and **needs nothing else installed on your side** — see *Requirements* at the top of this document. --- ## Changelog (recent) - **v3.0.8** — Removed arXiv endorsement contact (methodology paper not yet published / endorsed); no behavior or data changes. - **v3.0.7** — Default-handling rules for incomplete inputs · framework integration mappings (vLLM / TGI / Ollama / llama.cpp) · dataset version footer in every response · Requirements section. - **v3.0.6** — Anti-pattern table split into measured vs. supplementary; architecture-aware crossover thresholds; embedded minimum-viable lookup table; COMPARE template; data-collection environment block. - **v3.0.5** — Documentation refactor and cleanup; align with v3.0.5 crossover findings; advisory tone. --- ## Contact - Design partners / pilots: zhanghongping1982@gmail.com 🌍 *Making AI development more sustainable, one model at a time.*
don't have the plugin yet? install it then click "run inline in claude" again.