LLM Context Compression Benchmarks
Compiler-mode savings for real API-call behavior, plus fixed-ratio baselines for research comparison.
Policy Comparison at 35% Token Budget
| Policy | Oracle Recall | Entity Recall | Latency | KV Savings | Model Size |
|---|---|---|---|---|---|
| FIFO / Truncation | 25% | 73% | ~57 ms | ~65% | 0 (rule-based) |
| Summarization | 61% | 65% | ~63 ms | ~65% | LLM call |
| H2O (Heavy Hitter Oracle) | 98% | 73% | ~56 ms | ~65% | attention-based |
| SuperCompress Best | 100% | 73% | ~60 ms | ~65% | ~5K params |
Compiler Mode — Real API-Call Behavior
Compiler mode does not ask users for a budget. It removes the most tokens it can while preserving query-critical evidence and returning verifier metadata: important context kept, risk, kept blocks, and dropped blocks.
| Context Type | Original Tokens | After Compression | Tokens Removed | Savings |
|---|---|---|---|---|
| To Kill a Mockingbird study context | 1,454 | 611 | 843 | 58.0% |
| Long coding session log | 1,020 | 76 | 944 | 92.5% |
| Markdown documentation | 1,195 | 85 | 1,110 | 92.9% |
| Agent incident log | 1,074 | 56 | 1,018 | 94.8% |
| Average | 1,186 | 207 | 979 | 82.5% |
Visual Benchmarks
Environmental Impact at Scale
Based on documented SuperCompress assumptions (2,500 tok/GPU-s, 150W GPU, 55% KV share, 0.417 kg CO₂/kWh).
| Scale | Tokens Avoided | kWh Saved | CO₂ Avoided | Water Saved (est.) |
|---|---|---|---|---|
| 1 model call | ~800 | ~0.00003 | ~0.00001 kg | ~0.0001 L |
| 1,000 calls | ~800K | ~0.03 | ~0.01 kg | ~0.1 L |
| 1M calls | ~800M | ~29 | ~12 kg | ~100 L |
| 10M calls | ~8B | ~290 | ~120 kg | ~1,000 L |
Frequently Asked Questions
What is oracle recall?
Oracle recall measures how many of the lines that contain the answer to a specific question are preserved after compression. 100% oracle recall means every answer-critical line is kept. This is the most important quality metric for context compression.
How is SuperCompress different from truncation?
Truncation keeps only the head and tail of the context, dropping the middle. If the answer-critical line sits in the middle (which it often does), truncation loses it. SuperCompress scores every line against the question and keeps only the most relevant ones, regardless of position.
Does SuperCompress require GPU or extra LLM calls?
No. SuperCompress runs entirely on CPU with ~5K parameters and ~60ms latency on benchmark seeds. It requires zero GPU time and zero extra LLM calls — it's a small learned policy that runs before the language model.
Why is there no compression budget?
Compiler mode is designed for individual API calls: it removes as much as it safely can, then reports tokens saved, important context kept, and verifier risk. Fixed-ratio mode remains only for legacy comparisons.
Can I run SuperCompress with any LLM?
Yes. SuperCompress is model-agnostic. It compresses the context before sending it to the language model, so it works with OpenAI, Anthropic, open-weight models, or any LLM that accepts text input.
Is there a hosted API?
Yes. The SuperCompress hosted API is available at supercompress.dev/api/v1/compress. Get a free API key from the dashboard to get started. The Python client library wraps both local and API modes.
Try SuperCompress on your own context
Paste your long prompts and see exactly how much can be removed while keeping what matters.