Comparison guide
SuperCompress vs Headroom
Both compress context before LLM inference. But the architecture, query-awareness, and deployment story are radically different. Here is the real comparison — not marketing claims, but architecture-level differences that matter for your stack.
The fundamental difference: architecture
The most important difference between SuperCompress and Headroom is at the architecture level. SuperCompress uses a learned policy of approximately 5,000 parameters — small enough to load and run in microseconds. Headroom uses a ModernBERT-based model (165M+ parameters) with ONNX Runtime for efficient local execution.
This difference in architecture cascades into every aspect of the two tools: installation complexity, deployment options, latency, query-awareness, and cold start behavior.
| Factor | SuperCompress | Headroom | Winner |
|---|---|---|---|
| Policy/model size | ~5K parameters | ~165M+ (ModernBERT) | SuperCompress — 25,000x smaller |
| Model download required | None — pip install is all | Yes — downloads ModernBERT weights from HuggingFace | SuperCompress |
| Runtime dependency | None — pure Python/numpy | ONNX Runtime required for acceptable performance | SuperCompress |
| Warmup required | No — ready instantly | Yes — recommended at provision time | SuperCompress |
| Query-awareness | True query-awareness — scores every line against the actual question | Content-type detection (JSON, code, text) — no per-query scoring | SuperCompress |
| Oracle recall guarantee | ~100% — answer line guaranteed to be kept | Not guaranteed — content-type compression doesn't know what the question is | SuperCompress |
| Latency (cold start) | ~60ms | ~200ms+ (model load + warmup) | SuperCompress |
| Latency (warm) | ~60ms | ~100-200ms (ONNX Runtime inference) | SuperCompress |
| GPU required | No — CPU only | No (ONNX on CPU), but benefits from GPU | SuperCompress |
| Serverless compatible | Yes — Lambda, Cloudflare Workers, Edge | Difficult — 165M+ model too large for serverless | SuperCompress |
| Install size | ~200KB | ~500MB+ (model weights + ONNX runtime) | SuperCompress |
| Hosted API | Yes — free tier included | No — local-only | SuperCompress |
| License | MIT | Apache 2.0 | Tie — both permissive |
| GitHub Action / CI | Yes — CI pipeline integration | No | SuperCompress |
Query-awareness: the architectural advantage
This is the most important difference and it deserves its own section.
Headroom detects the type of content — JSON, code (AST), or plain text — and applies a compressor optimized for that type. It does not consider the user's question when deciding what to keep. A JSON compressor compresses JSON regardless of whether the JSON fields are relevant to the current query.
SuperCompress scores every line of context against the actual question the user is asking. A line about account settings is kept when the question is about billing, because the account type determines billing options. The same line is removed when the question is about feature requests, because account type is irrelevant.
This matters because in practice, 60-80% of context tokens are irrelevant to any specific question. Query-aware compression removes the right 60-80%. Content-type compression removes based on structure, which may keep irrelevant content and remove relevant content.
The cold start problem
Headroom's ModernBERT model needs to be loaded into memory before it can process prompts. The recommended setup warms up the model and tokenizer at provision time. This adds cold start latency — problematic in serverless environments where instances are created on demand.
SuperCompress has zero warmup. The ~5K-parameter policy is embedded in the Python package. The first call to compressor.compress() runs at full speed — ~60ms. No model loading, no tokenizer initialization, no ONNX runtime setup.
This makes SuperCompress ideal for serverless functions (AWS Lambda, Cloudflare Workers, Vercel Edge Functions) where cold starts are frequent and the execution environment is recycled between invocations.
Deployment comparison
Here is how each tool integrates into different deployment environments:
| Deployment | SuperCompress | Headroom |
|---|---|---|
| AWS Lambda | ✓ ~200KB package, ~60ms execution | ✗ Model too large for Lambda's /tmp and memory limits |
| Cloudflare Workers | ✓ Pure Python, minimal memory | ✗ ONNX Runtime not available in Workers |
| Vercel Edge | ✓ Fits within edge runtime limits | ✗ Too large for edge |
| Docker / K8s | ✓ ~150MB image | ✓ ~1GB+ image with ONNX and model weights |
| PyPI install | ✓ pip install supercompress | ✓ pip install headroom-ai[all] (~500MB download) |
| Hosted API | ✓ supercompress.dev with free tier | ✗ Local-only |
| CI/CD pipelines | ✓ GitHub Action built-in | ✗ Manual setup required |
When to choose each
Choose SuperCompress when:
- You want a simple, zero-dependency install — pip install and 3 lines of code. No model downloads, no ONNX runtime, no tokenizer.
- You need query-aware compression — the answer line must be kept for the specific question being asked, not just the content type.
- You deploy to serverless / edge — Lambda, Cloudflare Workers, Vercel Edge Functions. SuperCompress fits where Headroom's 165M+ model does not.
- You want a hosted API option — use the API when you don't want to run anything locally, or the local library when you do.
- You need CI/CD integration — the GitHub Action checks prompt costs on every PR.
- You want MIT license — slightly more permissive for commercial embedding.
Choose Headroom when:
- You need content-type-specific compression — JSON compression, AST-aware code compression, text compression handled by different specialized models.
- You want proxy mode — zero code changes by routing traffic through a local proxy.
- You need agent wrapping — auto-wrap Claude Code, Cursor, Aider, Cline without manual configuration.
- You want reversible compression (CCR) — originals cached locally, retrievable via tool call.
- You need cross-agent memory — shared context store across different AI agents.
- You need MCP server integration — Model Context Protocol support.
Integration complexity comparison
# SuperCompress — install and use in 3 lines
pip install supercompress
from supercompress import Compressor
comp = Compressor()
result = comp.compress(context, query)
# No model download. No ONNX. No tokenizer. No warmup.
# First call: ~60ms. Every call: ~60ms.
# Headroom — requires model download
pip install "headroom-ai[all]" # ~500MB download including model weights
from headroom import compress
result = compress(messages, model="auto")
# Downloads ModernBERT weights on first use
# Requires ONNX Runtime for acceptable performance
# Recommended to warmup at provision time
Cost comparison at scale
Both tools save you LLM token costs. But SuperCompress also saves you infrastructure costs:
| Cost Factor | SuperCompress | Headroom |
|---|---|---|
| Compute per compression | ~60ms CPU (negligible) | ~100-200ms CPU with ONNX |
| Memory per instance | ~50MB | ~500MB+ (model + runtime) |
| Storage for install | ~200KB | ~500MB+ |
| Serverless cost per 1K compressions | ~$0.0001 | Not feasible — model too large |
| Cold start penalty | None (~60ms always) | ~200ms+ model load time |
Try SuperCompress in your browser
No install. No model download. Open the playground, paste a long context, and see what SuperCompress keeps and removes. Runs entirely in your browser.