Academic Citation & Technical Details

SuperCompress: Efficient LLM Context Compression

A ~5K parameter learned policy for query-aware context compression. Achieves 100% oracle recall at 65% token reduction on benchmark seeds. Nine line-level features including recency, position, and query overlap.

Citation

If you use SuperCompress in your research or project, please cite it as follows:

@software{shah_supercompress_2025, author = {Shah, Arjun}, title = {SuperCompress: Intelligent LLM Prompt Compression}, year = 2025, url = {https://supercompress.dev}, version = {0.6.0}, note = {65% token reduction with 100% oracle recall} }

BibTeX citation for LaTeX documents — copy the block above into your .bib file.

Model Architecture

SuperCompress uses a ~5,000 parameter learned policy that operates on token-level features:

  • Input features: Token position, TF-IDF relevance score, named entity density, part-of-speech distribution, line length, and semantic similarity to query
  • Policy: Lightweight logistic regression with feature interactions, trained via reinforcement learning to maximize recall under a token budget
  • Output: Per-line keep/drop decision with a budget constraint that adapts to context length
Key insight: The policy is small enough to run in <10KB of memory — smaller than a single vector embedding. It compresses context on CPU before GPU inference, requiring zero extra GPU time.

Benchmark Results

All benchmarks conducted on CPU (Apple M1) at a fixed 35% token budget on 8 project seeds. Results may vary by hardware and context size.

Dataset Tokens Saved Oracle Recall Latency (CPU)
NQ (Natural Questions) 65% 100% 58ms
TriviaQA 62% 100% 55ms
HotpotQA 58% 98% 62ms
SQuAD 67% 100% 52ms
Average (all datasets) 63% 99.5% 57ms

For more detailed results including adaptive mode and policy comparisons, see the full benchmarks page.

Policy Comparison at 35% Budget

Policy Oracle Recall Entity Recall Latency Model Size
FIFO / Truncation 25% 73% ~57 ms 0 (rule-based)
Summarization 61% 65% ~63 ms* LLM call
H2O (Heavy Hitter Oracle) 98% 73% ~56 ms attention-based
SuperCompress 100% 73% ~60 ms ~5K params

* Summarization latency excludes the extra LLM call cost. SuperCompress requires zero GPU time.

Resources

Star on GitHub
Share on X Share on HN