Academic Citation & Technical Details

SuperCompress: Efficient LLM Context Compression

A ~5K parameter learned policy for query-aware context compression. Achieves 100% oracle recall at 65% token reduction on benchmark seeds. Nine line-level features including recency, position, and query overlap.

Citation

If you use SuperCompress in your research or project, please cite it as follows:

@software{shah_supercompress_2025, author = {Shah, Arjun}, title = {SuperCompress: Intelligent LLM Prompt Compression}, year = 2025, url = {https://supercompress.dev}, version = {0.6.0}, note = {65% token reduction with 100% oracle recall} }

BibTeX citation for LaTeX documents — copy the block above into your .bib file.

Model Architecture

SuperCompress uses a ~5,000 parameter learned policy that operates on token-level features:

Input features: Token position, TF-IDF relevance score, named entity density, part-of-speech distribution, line length, and semantic similarity to query
Policy: Lightweight logistic regression with feature interactions, trained via reinforcement learning to maximize recall under a token budget
Output: Per-line keep/drop decision with a budget constraint that adapts to context length

Key insight: The policy is small enough to run in <10KB of memory — smaller than a single vector embedding. It compresses context on CPU before GPU inference, requiring zero extra GPU time.

Benchmark Results

All benchmarks conducted on CPU (Apple M1) at a fixed 35% token budget on 8 project seeds. Results may vary by hardware and context size.

Dataset	Tokens Saved	Oracle Recall	Latency (CPU)
NQ (Natural Questions)	65%	100%	58ms
TriviaQA	62%	100%	55ms
HotpotQA	58%	98%	62ms
SQuAD	67%	100%	52ms
Average (all datasets)	63%	99.5%	57ms

For more detailed results including adaptive mode and policy comparisons, see the full benchmarks page.

Policy Comparison at 35% Budget

Policy	Oracle Recall	Entity Recall	Latency	Model Size
FIFO / Truncation	25%	73%	~57 ms	0 (rule-based)
Summarization	61%	65%	~63 ms*	LLM call
H2O (Heavy Hitter Oracle)	98%	73%	~56 ms	attention-based
SuperCompress	100%	73%	~60 ms	~5K params

* Summarization latency excludes the extra LLM call cost. SuperCompress requires zero GPU time.

Resources

GitHub Repository
PyPI Package — pip install supercompress
Interactive Comparison Tool
Full Benchmarks
Token Compression Guide
Playground
Awesome Token Compression List