Cut AI token costs by 60-85% with deterministic query validation and intelligent caching. Standalone or integrates with Company Brain Core OS. Free, open-sou...
SKILL.md

---
name: Token Reduction Engine
description: "Cut AI token costs by 60-85% with deterministic query validation and intelligent caching. Standalone or integrates with Company Brain Core OS. Free, open-source."
---

# Token Reduction Engine (TRE)

> **[Company Brain Core OS](https://certainlogic.ai/brain) — Free, local, deterministic knowledge base for your agent. Start here if you need persistent facts before optimizing token spend.**

**Cut AI costs. Not quality. Not accuracy.**

v1.0.0

**Built and dogfooded by CertainLogicAI** — Saves us 79%+ on every cached query. Measurement beats hope.

---

## Part of the CertainLogic Stack

This skill works **standalone**. You don't need anything else to start cutting token costs today.

**Works even better with [Company Brain Core OS](https://certainlogic.ai/brain)** (`clawhub install company-brain-os`):
- Brain stores validated facts — TRE caches deterministic responses
- **AgentPathfinder** — audits which cache policies led to good outcomes
- **Smart Router** — routes non-cached queries to cheaper model tiers

**All four are independent.** Pick the one that solves your problem. Add others when you need them.

---

## How It Works — Three Layers

TRE sits between your agent and the LLM. It intercepts queries and decides: cache hit, deterministic lookup, or expensive LLM call.

```
User Query → TRE Decision Layer → Result (cache | brain | LLM)
```

### Layer 1: Query Cache (Answer Cache)

Stores LLM responses so repeated queries return instantly — zero tokens.

- **SHA-256 hash** of exact query as cache key
- **LRU eviction** — keeps hot answers, drops cold ones
- **TTL-based expiry** — stale data auto-purges (default: 3600s)
- **Hallucination Guard** — hedged answers ("I think", "maybe", "not sure") are shown to user but NOT cached

```python
# After first LLM call:
tre.cache_answer("What is our refund policy?", "30 days, no questions.")

# Every subsequent call:
answer = tre.get_cached_answer("What is our refund policy?")
# → Instant. Zero tokens. No LLM hit.
```

### Layer 2: Deterministic Lookup

For questions with factual answers, skip the LLM entirely.

- **Intent classification** — matches query to known domain (strategy, product, security, etc.)
- **Facts DB overlay** — structured key-value pairs with source attribution
- **Zero LLM cost** — pre-validated facts served direct from SQLite

```
"What products do we offer?" → Intent: product → Facts DB → Instant answer
"How does the brain work?" → Intent: strategy → Facts DB → Instant answer
"Who wrote the security policy?" → Intent: security → Facts DB → Instant answer
```

### Layer 3: Intent Filtering (Sanity Gate)

Prevents expensive LLM calls that can't possibly have good answers.

- **Forbidden intents** — blocks "brain.delete_brain", "brain.purge", malicious commands
- **Required fields gate** — rejects malformed queries before they hit the LLM
- **Domain alignment** — refuses answers outside declared intent scope (prevents hallucinated guesses)

---

## Major Benefits

### 1. **60-85% Token Cost Reduction**

- **Cache hit rate: up to 79%** (measured on Brain workloads — your results depend on query repetition)
- **Deterministic path: <10ms** vs 500-2000ms LLM round-trip  
- **Cost comparison**: $0 per cached answer vs $0.01-0.15 per LLM call

| Workload | Without TRE | With TRE | Savings |
|----------|------------|----------|---------|
| Repetitive queries (support, FAQ) | $10-15/day | $1-3/day | **85%** |
| Mixed unique + repeat | $300/mo | $45-90/mo | **40-70%** |

### 2. **Zero Hallucination on Cached Facts**

- **Hallucination Guard** detects hedged language: "I think", "maybe", "probably", "not sure"
- **Flagged responses are shown but not cached** — prevents poisoned cache
- **SH A-256 verification** on every write — tamper detection for audit trails
- **Source attribution** — every cached fact traces back to who loaded it and when

### 3. **Instant Sub-10ms Responses**

- **Instant responses** — SQLite-backed local cache serves answers from RAM (no network round-trip)
- **Cold → hot acceleration** — cache warmup brings first-query latency from ~700ms to sub-50ms

### 4. **Agent Won't Run Amok**

- Forbidden command list — "brain.delete_brain", "brain.purge" are blocked before execution
- Intent scope enforcement — agent refuses queries outside declared domain
- SHA-256 write verification — any tampered cache entry is rejected on read
- [AgentPathfinder](https://certainlogic.ai/pathfinder) integration — HMAC-signed audit trail on every decision

### 5. **Works With Any LLM Stack**

- **OpenAI** (GPT-4o, GPT-4o-mini)
- **Anthropic** (Claude 3.5 Sonnet, Claude 3 Opus)
- **Local models** (Ollama, llama.cpp, vLLM)
- **Multi-provider setups** — TRE sits in front of all of them

```yaml
# Works with any backend
llm_provider: openai  # or anthropic, local, azure, etc.
tre:
  cache_ttl: 3600
  max_cache_size: 10000
  hallucination_guard: true
```

### 6. **Persisted Across Restarts**

- Cache auto-saves to disk on every write
- Load persisted cache on startup — no warm-up period
- Configurable persistence path — store anywhere (local disk, shared volume, etc.)

### 7. **Standalone — No Lock-in**

- Runs on localhost — no external API, no SaaS dependency
- MIT License — free to fork, modify, commercialize
- No telemetry — your queries never leave your machine
- 50-line integration — drop into existing agent in minutes

---

## Standalone Usage

```bash
clawhub install certainlogic-tre
```

```python
from tre import cache_answer, get_cached_answer, get_metrics

# After getting an LLM response
tre.cache_answer(query, answer)

# Next time — zero tokens
cached = tre.get_cached_answer(query)
if cached:
    answer, token_count = cached
    print(f"Instant! Saved {token_count} tokens.")

# Check your savings
print(tre.get_metrics())
# {'cache_hits': 853, 'cache_misses': 223, 'cache_hit_rate_percent': 79.3}
```

See `references/CONFIGURATION.md` for full config options (TTL, cache size, guard sensitivity, forbidden commands).

See `references/API.md` for Python API reference.

---

## Integration with Company Brain Core OS

When Brain is installed, TRE gets smarter:

- **Facts DB pre-populates cache** — Brain's 52+ facts are available instantly
- **Intent classification improves** — Brain's domain model refines routing
- **Query classification** — Brain categorizes new queries, expanding cache coverage

```bash
# Install both
clawhub install company-brain-os certainlogic-tre

# Brain loads facts → TRE caches them → instant responses for common questions
```

---

## Metrics (Measured Ourselves)

| Metric | Value | Measurement Period |
|--------|-------|-------------------|
| Cache hit rate | **79.3%** | 2026-05-09, 615 queries |
| Avg latency (cache hit) | **<10ms** | Local SQLite |
| Avg latency (LLM miss) | **757ms** | OpenRouter fallback |
| Hallucination guard flags | **0.2%** | Hedge language detection |
| Token cost savings | **85%** | Cached queries vs LLM calls |
| Forbidden command blocks | **100%** | All `delete_brain`, `purge` blocked |

---

## Free vs Pro

| Feature | Free | Pro (Planned) |
|---------|------|---------------|
| Query answer cache | ✅ | ✅ |
| Deterministic lookup | ✅ | ✅ |
| Hallucination guard | ✅ | ✅ |
| Intent filtering | ✅ | ✅ |
| Cache persistence | ✅ | ✅ |
| Brain integration | ✅ | ✅ |
| **Distributed cache** (Redis) | ❌ | ✅ |
| **Cache analytics dashboard** | ❌ | ✅ |
| **A/B testing policies** | ❌ | ✅ |
| **Team policy sharing** | ❌ | ✅ |
| **Auto policy optimization** | ❌ | ✅ |

**Pro pricing:** TBD — join the waitlist at certainlogic.ai/tre

---

## License

MIT — free to use, modify, distribute.

**Built with brutal honesty by [CertainLogic](https://certainlogic.ai)**
Token Reduction Engine

SKILL.md

related skills