How a ~5,000-parameter query-aware policy running on CPU removes ~65% of prompt tokens before inference while preserving answer-critical evidence — a technical deep dive.
SuperCompress is a query-aware context compression engine. It sits between your application and the LLM, analyzing every line of the prompt against the user's question and removing content that scores low on relevance.
The key design choices:
The system is designed for serverless deployment — cold start is ~60ms, making it suitable for AWS Lambda, Cloudflare Workers, and Vercel Edge Functions.
The input context is split into semantic segments. Each segment is a coherent unit — a paragraph, a code block, a list item, or a section of a log file. The segmenter uses heuristics (blank lines, indentation, markdown headers, code fence boundaries) to identify natural breaks.
Each segment is converted into a feature vector that captures:
The learned policy takes the feature vector and outputs a relevance score (0–1) for each segment. The threshold is adaptive — the policy adjusts its cutoff based on the distribution of scores and the target compression ratio.
Segments with relevance scores above the threshold are assembled back into compressed context, preserving their original order. The assembly step also handles edge cases: keeping context boundaries legible, re-inserting necessary transition text, and ensuring the output is valid for the downstream LLM.
SuperCompress targets 97%+ oracle recall — meaning the compressed prompt contains virtually all the information needed to answer the query, even when evaluated against the full original context. This is achieved by keeping the scorer conservative: when uncertain, it keeps the segment rather than dropping it.
The policy is a small feed-forward neural network with approximately 5,000 parameters. To put that in context:
| Model | Parameters | Relative size |
|---|---|---|
| SuperCompress policy | ~5K | 1× |
| Headroom (ModernBERT) | 165M+ | 33,000× |
| GPT-2 Small | 124M | 24,800× |
| BERT-Base | 110M | 22,000× |
| MiniLM-L6 | 22.7M | 4,540× |
The small size is intentional. It means:
The policy was trained on a dataset of ~100K (context, query, keep/drop) examples generated from real production prompts across customer support, code generation, data extraction, and Q&A tasks. Training used a contrastive loss that penalizes dropping segments that contain information needed to answer the query.
In addition to the basic scoring policy, SuperCompress offers a compiler mode that applies a second-pass optimization. The compiler:
On the bundled long-context presets, compiler mode achieves 82.5% average token savings with low-risk verification — compared to ~65% for the basic scoring policy alone.
The verifier is a lightweight post-processing step that checks the compressed output for potential information loss. It answers two questions:
The verifier produces a risk score (low, medium, high) for each compression. Applications can use this score to decide whether to use the compressed output, fall back to the original, or route to a more expensive model for verification.
In practice, >95% of compressions across the benchmark suite receive a low-risk score, and the high-risk cases are typically very short prompts where compression has little room to operate.
SuperCompress ships in three forms:
| Form | Use case | Latency | Setup |
|---|---|---|---|
| Python library | Local in-process compression | ~15ms | pip install supercompress |
| Hosted API | Zero-infrastructure compression | ~5ms + network | Get a key at /dashboard |
| GitHub Action | CI/CD pipeline compression | — | uses: arjunkshah/supercompress/.github/actions/... |
All three forms share the same core policy and produce identical compression results for the same inputs. The hosted API adds automatic scaling, rate limiting, and usage analytics through the dashboard.
For framework-specific setup, see the integration guides for FastAPI, Flask, Django, Express.js, and Spring Boot.
SuperCompress's architecture differs fundamentally from other prompt compression approaches. Here's how:
| Method | Approach | Query-aware | Model size | Latency |
|---|---|---|---|---|
| SuperCompress | Learned scoring policy | Yes | ~5K params | ~15ms |
| Headroom | Transformer classifier | Type-only | 165M+ | ~100ms + warmup |
| Sliding window | Fixed-position truncation | No | 0 | ~0ms |
| Top-K retrieval | Embedding similarity | Partial | Embedding model | ~50ms |
| HyDE | Hypothetical doc + retrieval | Yes | LLM + embedding | ~500ms+ |
For a comprehensive comparison, see the alternatives hub and the full guide.