Cut AI token costs by 60-85% with deterministic query validation and intelligent caching. Standalone or integrates with Company Brain Core OS. Free, open-sou...
---
name: Token Reduction Engine
description: "Cut AI token costs by 60-85% with deterministic query validation and intelligent caching. Standalone or integrates with Company Brain Core OS. Free, open-source."
---
# Token Reduction Engine (TRE)
> **[Company Brain Core OS](https://certainlogic.ai/brain) — Free, local, deterministic knowledge base for your agent. Start here if you need persistent facts before optimizing token spend.**
**Cut AI costs. Not quality. Not accuracy.**
v1.0.0
**Built and dogfooded by CertainLogicAI** — Saves us 79%+ on every cached query. Measurement beats hope.
---
## Part of the CertainLogic Stack
This skill works **standalone**. You don't need anything else to start cutting token costs today.
**Works even better with [Company Brain Core OS](https://certainlogic.ai/brain)** (`clawhub install company-brain-os`):
- Brain stores validated facts — TRE caches deterministic responses
- **AgentPathfinder** — audits which cache policies led to good outcomes
- **Smart Router** — routes non-cached queries to cheaper model tiers
**All four are independent.** Pick the one that solves your problem. Add others when you need them.
---
## How It Works — Three Layers
TRE sits between your agent and the LLM. It intercepts queries and decides: cache hit, deterministic lookup, or expensive LLM call.
```
User Query → TRE Decision Layer → Result (cache | brain | LLM)
```
### Layer 1: Query Cache (Answer Cache)
Stores LLM responses so repeated queries return instantly — zero tokens.
- **SHA-256 hash** of exact query as cache key
- **LRU eviction** — keeps hot answers, drops cold ones
- **TTL-based expiry** — stale data auto-purges (default: 3600s)
- **Hallucination Guard** — hedged answers ("I think", "maybe", "not sure") are shown to user but NOT cached
```python
# After first LLM call:
tre.cache_answer("What is our refund policy?", "30 days, no questions.")
# Every subsequent call:
answer = tre.get_cached_answer("What is our refund policy?")
# → Instant. Zero tokens. No LLM hit.
```
### Layer 2: Deterministic Lookup
For questions with factual answers, skip the LLM entirely.
- **Intent classification** — matches query to known domain (strategy, product, security, etc.)
- **Facts DB overlay** — structured key-value pairs with source attribution
- **Zero LLM cost** — pre-validated facts served direct from SQLite
```
"What products do we offer?" → Intent: product → Facts DB → Instant answer
"How does the brain work?" → Intent: strategy → Facts DB → Instant answer
"Who wrote the security policy?" → Intent: security → Facts DB → Instant answer
```
### Layer 3: Intent Filtering (Sanity Gate)
Prevents expensive LLM calls that can't possibly have good answers.
- **Forbidden intents** — blocks "brain.delete_brain", "brain.purge", malicious commands
- **Required fields gate** — rejects malformed queries before they hit the LLM
- **Domain alignment** — refuses answers outside declared intent scope (prevents hallucinated guesses)
---
## Major Benefits
### 1. **60-85% Token Cost Reduction**
- **Cache hit rate: up to 79%** (measured on Brain workloads — your results depend on query repetition)
- **Deterministic path: <10ms** vs 500-2000ms LLM round-trip
- **Cost comparison**: $0 per cached answer vs $0.01-0.15 per LLM call
| Workload | Without TRE | With TRE | Savings |
|----------|------------|----------|---------|
| Repetitive queries (support, FAQ) | $10-15/day | $1-3/day | **85%** |
| Mixed unique + repeat | $300/mo | $45-90/mo | **40-70%** |
### 2. **Zero Hallucination on Cached Facts**
- **Hallucination Guard** detects hedged language: "I think", "maybe", "probably", "not sure"
- **Flagged responses are shown but not cached** — prevents poisoned cache
- **SH A-256 verification** on every write — tamper detection for audit trails
- **Source attribution** — every cached fact traces back to who loaded it and when
### 3. **Instant Sub-10ms Responses**
- **Instant responses** — SQLite-backed local cache serves answers from RAM (no network round-trip)
- **Cold → hot acceleration** — cache warmup brings first-query latency from ~700ms to sub-50ms
### 4. **Agent Won't Run Amok**
- Forbidden command list — "brain.delete_brain", "brain.purge" are blocked before execution
- Intent scope enforcement — agent refuses queries outside declared domain
- SHA-256 write verification — any tampered cache entry is rejected on read
- [AgentPathfinder](https://certainlogic.ai/pathfinder) integration — HMAC-signed audit trail on every decision
### 5. **Works With Any LLM Stack**
- **OpenAI** (GPT-4o, GPT-4o-mini)
- **Anthropic** (Claude 3.5 Sonnet, Claude 3 Opus)
- **Local models** (Ollama, llama.cpp, vLLM)
- **Multi-provider setups** — TRE sits in front of all of them
```yaml
# Works with any backend
llm_provider: openai # or anthropic, local, azure, etc.
tre:
cache_ttl: 3600
max_cache_size: 10000
hallucination_guard: true
```
### 6. **Persisted Across Restarts**
- Cache auto-saves to disk on every write
- Load persisted cache on startup — no warm-up period
- Configurable persistence path — store anywhere (local disk, shared volume, etc.)
### 7. **Standalone — No Lock-in**
- Runs on localhost — no external API, no SaaS dependency
- MIT License — free to fork, modify, commercialize
- No telemetry — your queries never leave your machine
- 50-line integration — drop into existing agent in minutes
---
## Standalone Usage
```bash
clawhub install certainlogic-tre
```
```python
from tre import cache_answer, get_cached_answer, get_metrics
# After getting an LLM response
tre.cache_answer(query, answer)
# Next time — zero tokens
cached = tre.get_cached_answer(query)
if cached:
answer, token_count = cached
print(f"Instant! Saved {token_count} tokens.")
# Check your savings
print(tre.get_metrics())
# {'cache_hits': 853, 'cache_misses': 223, 'cache_hit_rate_percent': 79.3}
```
See `references/CONFIGURATION.md` for full config options (TTL, cache size, guard sensitivity, forbidden commands).
See `references/API.md` for Python API reference.
---
## Integration with Company Brain Core OS
When Brain is installed, TRE gets smarter:
- **Facts DB pre-populates cache** — Brain's 52+ facts are available instantly
- **Intent classification improves** — Brain's domain model refines routing
- **Query classification** — Brain categorizes new queries, expanding cache coverage
```bash
# Install both
clawhub install company-brain-os certainlogic-tre
# Brain loads facts → TRE caches them → instant responses for common questions
```
---
## Metrics (Measured Ourselves)
| Metric | Value | Measurement Period |
|--------|-------|-------------------|
| Cache hit rate | **79.3%** | 2026-05-09, 615 queries |
| Avg latency (cache hit) | **<10ms** | Local SQLite |
| Avg latency (LLM miss) | **757ms** | OpenRouter fallback |
| Hallucination guard flags | **0.2%** | Hedge language detection |
| Token cost savings | **85%** | Cached queries vs LLM calls |
| Forbidden command blocks | **100%** | All `delete_brain`, `purge` blocked |
---
## Free vs Pro
| Feature | Free | Pro (Planned) |
|---------|------|---------------|
| Query answer cache | ✅ | ✅ |
| Deterministic lookup | ✅ | ✅ |
| Hallucination guard | ✅ | ✅ |
| Intent filtering | ✅ | ✅ |
| Cache persistence | ✅ | ✅ |
| Brain integration | ✅ | ✅ |
| **Distributed cache** (Redis) | ❌ | ✅ |
| **Cache analytics dashboard** | ❌ | ✅ |
| **A/B testing policies** | ❌ | ✅ |
| **Team policy sharing** | ❌ | ✅ |
| **Auto policy optimization** | ❌ | ✅ |
**Pro pricing:** TBD — join the waitlist at certainlogic.ai/tre
---
## License
MIT — free to use, modify, distribute.
**Built with brutal honesty by [CertainLogic](https://certainlogic.ai)**
don't have the plugin yet? install it then click "run inline in claude" again.