turboquant-pytorch

Name: turboquant-pytorch
Availability: InStock
Author: aradotso

PyTorch implementation of TurboQuant for LLM KV cache compression using two-stage vector quantization (random rotation + Lloyd-Max + QJL residual correction).

view source

installs

stars

karma

SKILL.md

TurboQuant PyTorch

Skill by ara.so — Daily 2026 Skills collection.

From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for compressing LLM KV caches. Achieves 5x compression at 3-bit with 99.5% attention fidelity via two-stage vector quantization.

What It Does

TurboQuant compresses LLM key-value caches to 2–4 bits per coordinate:

Stage 1: Random orthogonal rotation + Lloyd-Max scalar quantization (MSE-optimal)

Stage 2: QJL residual correction — 1-bit sign projection that makes inner product estimates unbiased

Result: attention scores remain accurate even when individual vectors look quite different from originals. The algorithm preserves inner products, not vector fidelity.

Compression ratios at 8K context on Qwen2.5-3B (289 MB FP16 baseline):

4-bit → 76 MB (3.8x)

3-bit → 58 MB (5.0x) ← practical sweet spot

2-bit → 40 MB (7.3x)

related skills

semantically similar in the cross-vendor index

skills.sh

64% match

hqq-quantization

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast…

don't have the plugin yet? install it then click "run inline in claude" again.