A side-by-side comparison of every major prompt compression method — from lightweight scoring to full-model approaches — with benchmarks, trade-offs, and deployment guidance.
| Method | Approach | Query-Aware | Model Size | Latency | Savings |
|---|---|---|---|---|---|
| SuperCompress | Learned policy (5K params) | Yes | ~200KB | ~15ms | ~65% |
| Headroom | ModernBERT (165M+ params) | No (type-aware only) | ~500MB | ~100ms + warmup | ~50% |
| Sliding Window | Keep last N tokens | No | 0 (rule-based) | ~0ms | Fixed 50% |
| Top-K Retrieval | Embedding similarity | Partial | Embedding model | ~50ms | Variable |
| HyDE | Hypothetical doc + retrieval | Yes | LLM + embedding | ~500ms+ | Variable |
| MMR | Diversity sampling | No | 0 (algorithmic) | ~20ms | Variable |
| Self-RAG | On-demand LLM retrieval decisions | Yes | LLM | ~1s+ | Variable |
Headroom uses a 165M+ parameter ModernBERT model with ONNX Runtime. It classifies prompt content by type (JSON, code, text) but does not score against the user's query. SuperCompress is 25,000x smaller and query-aware.
5K params, query-aware 165M+ params, type-onlySliding window keeps the last N tokens. It's fast and simple but drops older context that may be critical. SuperCompress keeps relevant context regardless of position.
Position-independent Always drops tailTop-K uses embedding similarity to find relevant passages. It works well for RAG but misses non-vector context like instructions and conversation history.
Full prompt analysis Embedding-dependentHyDE generates a hypothetical answer before retrieving. It adds an entire LLM call to the critical path, making it expensive for high-throughput systems.
~15ms, no extra LLM call 2x+ LLM costMMR trades relevance for diversity. It prevents redundancy but may drop the single most relevant passage. SuperCompress optimizes for the user's specific question.
Query-optimized Diversity-optimizedSelf-RAG lets the LLM decide when to retrieve. It's the most flexible but most expensive approach, requiring the LLM to participate in retrieval decisions.
Deterministic, fast LLM-participating, slowpip install supercompress is fully self-contained. No downloading ModernBERT weights, ONNX Runtime, or ~500MB of dependencies.from supercompress import compressNot always. Headroom's proxy mode and MCP integration make it easier to add to existing applications without code changes. SuperCompress requires adding a library import or API call. However, in terms of compression quality, deployment footprint, and latency at scale, SuperCompress has clear architectural advantages — see the full comparison for details.
Yes. SuperCompress's query-aware scoring complements other methods. For example, you can apply SuperCompress before a sliding window as a safety net, or use MMR for candidate selection and SuperCompress for final compression. See the guide for advanced patterns.
Sliding window has zero latency (simple array slice). SuperCompress adds ~15ms. Top-K/MMR add ~20-50ms. Headroom adds ~100ms + cold-start warmup. HyDE and Self-RAG add 500ms+ because they involve additional LLM calls.