LLM Context Compression Benchmarks

Compiler-mode savings for real API-call behavior, plus fixed-ratio baselines for research comparison.

Policy Comparison at 35% Token Budget

Policy Oracle Recall Entity Recall Latency KV Savings Model Size
FIFO / Truncation 25% 73% ~57 ms ~65% 0 (rule-based)
Summarization 61% 65% ~63 ms ~65% LLM call
H2O (Heavy Hitter Oracle) 98% 73% ~56 ms ~65% attention-based
SuperCompress Best 100% 73% ~60 ms ~65% ~5K params

Legacy baseline on 8 project seeds. Oracle recall = answer-critical lines preserved. Entity recall = named entities retained.

Compiler Mode — Real API-Call Behavior

Compiler mode does not ask users for a budget. It removes the most tokens it can while preserving query-critical evidence and returning verifier metadata: important context kept, risk, kept blocks, and dropped blocks.

Context Type Original Tokens After Compression Tokens Removed Savings
To Kill a Mockingbird study context 1,454 611 843 58.0%
Long coding session log 1,020 76 944 92.5%
Markdown documentation 1,195 85 1,110 92.9%
Agent incident log 1,074 56 1,018 94.8%
Average 1,186 207 979 82.5%

Visual Benchmarks

Legacy fixed-ratio oracle recall: SuperCompress 100%, H2O about 98%, FIFO and truncation about 25%
Legacy fixed-ratio oracle recall baseline
Compiler mode token savings on long-context presets
Compiler mode token savings on real-world contexts

Environmental Impact at Scale

Based on documented SuperCompress assumptions (2,500 tok/GPU-s, 150W GPU, 55% KV share, 0.417 kg CO₂/kWh).

Scale Tokens Avoided kWh Saved CO₂ Avoided Water Saved (est.)
1 model call ~800 ~0.00003 ~0.00001 kg ~0.0001 L
1,000 calls ~800K ~0.03 ~0.01 kg ~0.1 L
1M calls ~800M ~29 ~12 kg ~100 L
10M calls ~8B ~290 ~120 kg ~1,000 L

Full methodology: Environment guide.

Frequently Asked Questions

What is oracle recall?

Oracle recall measures how many of the lines that contain the answer to a specific question are preserved after compression. 100% oracle recall means every answer-critical line is kept. This is the most important quality metric for context compression.

How is SuperCompress different from truncation?

Truncation keeps only the head and tail of the context, dropping the middle. If the answer-critical line sits in the middle (which it often does), truncation loses it. SuperCompress scores every line against the question and keeps only the most relevant ones, regardless of position.

Does SuperCompress require GPU or extra LLM calls?

No. SuperCompress runs entirely on CPU with ~5K parameters and ~60ms latency on benchmark seeds. It requires zero GPU time and zero extra LLM calls — it's a small learned policy that runs before the language model.

Why is there no compression budget?

Compiler mode is designed for individual API calls: it removes as much as it safely can, then reports tokens saved, important context kept, and verifier risk. Fixed-ratio mode remains only for legacy comparisons.

Can I run SuperCompress with any LLM?

Yes. SuperCompress is model-agnostic. It compresses the context before sending it to the language model, so it works with OpenAI, Anthropic, open-weight models, or any LLM that accepts text input.

Is there a hosted API?

Yes. The SuperCompress hosted API is available at supercompress.dev/api/v1/compress. Get a free API key from the dashboard to get started. The Python client library wraps both local and API modes.

Try SuperCompress on your own context

Paste your long prompts and see exactly how much can be removed while keeping what matters.

Open Playground Get API Key
Share on X Share on HN
Star on GitHub