Token compression guide
LLM token compression before inference
LLM token compression removes waste before the model spends compute on it. The goal is to preserve the evidence needed for the answer while cutting everything else.
What is LLM token compression?
LLM token compression reduces the number of tokens sent to a language model during inference while preserving the information needed to answer the current query.
In a typical RAG query, 60-80% of retrieved context may be irrelevant to the specific question. Those tokens still cost money and add latency.
Token compression methods compared
| Method | Oracle Recall | Latency | Extra Cost |
|---|---|---|---|
| SuperCompress | 100% | ~60ms CPU | None |
| Summarization | 61% | ~63ms + LLM call | Full LLM cost |
| Truncation (FIFO) | 25% | ~57ms | None |
Integrating with your LLM pipeline
pip install supercompress
from supercompress import Compressor
comp = Compressor()
result = comp.compress(context, query)
Frequently asked questions
Does token compression require a model download?
No. SuperCompress uses a tiny CPU policy and does not require a model download or GPU.
Does it work with local models?
Yes. Compression is model-agnostic and works before any model call.
Try it yourself
Paste your long prompt into the playground, ask a question, and see what SuperCompress keeps and removes. Free, no signup needed.