SuperCompress Architecture

Last updated July 3, 2026 · 10 min read

How a ~5,000-parameter query-aware policy running on CPU removes ~65% of prompt tokens before inference while preserving answer-critical evidence — a technical deep dive.

1. System Overview
2. The Compression Pipeline
3. The Learned Policy
4. Compiler Mode
5. The Verifier
6. Deployment Architecture
7. Comparison with Alternatives

1. System Overview

SuperCompress is a query-aware context compression engine. It sits between your application and the LLM, analyzing every line of the prompt against the user's question and removing content that scores low on relevance.

Context + Query
↓
Segmenter → Scorer (~5K params) → Compiler → Verifier
↓
Compressed Context → LLM

The key design choices:

~5K parameters — a tiny feed-forward network that runs on CPU in ~15ms. No GPU needed.
Query-aware — scores each passage against the user's specific question, not just the content type.
No model download — the policy is bundled in the pip package (~200KB total).
Deterministic given inputs — same context + query always produces the same compression. Predictable and auditable.

The system is designed for serverless deployment — cold start is ~60ms, making it suitable for AWS Lambda, Cloudflare Workers, and Vercel Edge Functions.

2. The Compression Pipeline

Step 1: Segmentation

The input context is split into semantic segments. Each segment is a coherent unit — a paragraph, a code block, a list item, or a section of a log file. The segmenter uses heuristics (blank lines, indentation, markdown headers, code fence boundaries) to identify natural breaks.

Step 2: Feature Extraction

Each segment is converted into a feature vector that captures:

Lexical overlap — n-gram similarity between the segment and the query
Positional features — where the segment appears in the context (beginning, middle, end)
Structural features — whether it's code, prose, JSON, markdown, etc.
Novelty score — information overlap with previously seen segments

Step 3: Scoring

The learned policy takes the feature vector and outputs a relevance score (0–1) for each segment. The threshold is adaptive — the policy adjusts its cutoff based on the distribution of scores and the target compression ratio.

Step 4: Assembly

Segments with relevance scores above the threshold are assembled back into compressed context, preserving their original order. The assembly step also handles edge cases: keeping context boundaries legible, re-inserting necessary transition text, and ensuring the output is valid for the downstream LLM.

Key property: oracle recall

SuperCompress targets 97%+ oracle recall — meaning the compressed prompt contains virtually all the information needed to answer the query, even when evaluated against the full original context. This is achieved by keeping the scorer conservative: when uncertain, it keeps the segment rather than dropping it.

3. The Learned Policy

The policy is a small feed-forward neural network with approximately 5,000 parameters. To put that in context:

Model	Parameters	Relative size
SuperCompress policy	~5K	1×
Headroom (ModernBERT)	165M+	33,000×
GPT-2 Small	124M	24,800×
BERT-Base	110M	22,000×
MiniLM-L6	22.7M	4,540×

The small size is intentional. It means:

The policy fits in L1 cache on modern CPUs — no memory bottleneck
Inference completes in microseconds per segment
The entire package (policy + code) is ~200KB — fits in a single network packet
No GPU, no ONNX Runtime, no model download
Edge deployment is straightforward — the policy compiles to WASM for Cloudflare Workers

The policy was trained on a dataset of ~100K (context, query, keep/drop) examples generated from real production prompts across customer support, code generation, data extraction, and Q&A tasks. Training used a contrastive loss that penalizes dropping segments that contain information needed to answer the query.

4. Compiler Mode

In addition to the basic scoring policy, SuperCompress offers a compiler mode that applies a second-pass optimization. The compiler:

Deduplicates — removes repeated information across segments, keeping the most complete version
Strips tool noise — removes boilerplate from tool outputs, stack traces, and logging statements that don't contribute to the answer
Preserves dependencies — if a later segment references an earlier definition, both are kept even if one scores low individually
Reports verification risk — the compiler flags segments where dropping them carries uncertainty, producing a risk score alongside the compressed output

On the bundled long-context presets, compiler mode achieves 82.5% average token savings with low-risk verification — compared to ~65% for the basic scoring policy alone.

5. The Verifier

The verifier is a lightweight post-processing step that checks the compressed output for potential information loss. It answers two questions:

Entity coverage: Are all named entities from the original context that are relevant to the query preserved in the compressed output?
Dependency integrity: If the context defines a term or concept that the query references, is that definition still present?

The verifier produces a risk score (low, medium, high) for each compression. Applications can use this score to decide whether to use the compressed output, fall back to the original, or route to a more expensive model for verification.

In practice, >95% of compressions across the benchmark suite receive a low-risk score, and the high-risk cases are typically very short prompts where compression has little room to operate.

6. Deployment Architecture

SuperCompress ships in three forms:

Form	Use case	Latency	Setup
Python library	Local in-process compression	~15ms	`pip install supercompress`
Hosted API	Zero-infrastructure compression	~5ms + network	Get a key at /dashboard
GitHub Action	CI/CD pipeline compression	—	`uses: arjunkshah/supercompress/.github/actions/...`

All three forms share the same core policy and produce identical compression results for the same inputs. The hosted API adds automatic scaling, rate limiting, and usage analytics through the dashboard.

For framework-specific setup, see the integration guides for FastAPI, Flask, Django, Express.js, and Spring Boot.

7. Comparison with Alternatives

SuperCompress's architecture differs fundamentally from other prompt compression approaches. Here's how:

Method	Approach	Query-aware	Model size	Latency
SuperCompress	Learned scoring policy	Yes	~5K params	~15ms
Headroom	Transformer classifier	Type-only	165M+	~100ms + warmup
Sliding window	Fixed-position truncation	No	0	~0ms
Top-K retrieval	Embedding similarity	Partial	Embedding model	~50ms
HyDE	Hypothetical doc + retrieval	Yes	LLM + embedding	~500ms+

For a comprehensive comparison, see the alternatives hub and the full guide.

Try it in the playground → GitHub Get an API key