SuperCompress Architecture

Last updated July 3, 2026 · 10 min read

How a ~5,000-parameter query-aware policy running on CPU removes ~65% of prompt tokens before inference while preserving answer-critical evidence — a technical deep dive.

Contents

  1. 1. System Overview
  2. 2. The Compression Pipeline
  3. 3. The Learned Policy
  4. 4. Compiler Mode
  5. 5. The Verifier
  6. 6. Deployment Architecture
  7. 7. Comparison with Alternatives

1. System Overview

SuperCompress is a query-aware context compression engine. It sits between your application and the LLM, analyzing every line of the prompt against the user's question and removing content that scores low on relevance.

Context + Query

Segmenter → Scorer (~5K params) → Compiler → Verifier

Compressed Context → LLM

The key design choices:

The system is designed for serverless deployment — cold start is ~60ms, making it suitable for AWS Lambda, Cloudflare Workers, and Vercel Edge Functions.

2. The Compression Pipeline

Step 1: Segmentation

The input context is split into semantic segments. Each segment is a coherent unit — a paragraph, a code block, a list item, or a section of a log file. The segmenter uses heuristics (blank lines, indentation, markdown headers, code fence boundaries) to identify natural breaks.

Step 2: Feature Extraction

Each segment is converted into a feature vector that captures:

Step 3: Scoring

The learned policy takes the feature vector and outputs a relevance score (0–1) for each segment. The threshold is adaptive — the policy adjusts its cutoff based on the distribution of scores and the target compression ratio.

Step 4: Assembly

Segments with relevance scores above the threshold are assembled back into compressed context, preserving their original order. The assembly step also handles edge cases: keeping context boundaries legible, re-inserting necessary transition text, and ensuring the output is valid for the downstream LLM.

Key property: oracle recall

SuperCompress targets 97%+ oracle recall — meaning the compressed prompt contains virtually all the information needed to answer the query, even when evaluated against the full original context. This is achieved by keeping the scorer conservative: when uncertain, it keeps the segment rather than dropping it.

3. The Learned Policy

The policy is a small feed-forward neural network with approximately 5,000 parameters. To put that in context:

Model Parameters Relative size
SuperCompress policy ~5K
Headroom (ModernBERT) 165M+ 33,000×
GPT-2 Small 124M 24,800×
BERT-Base 110M 22,000×
MiniLM-L6 22.7M 4,540×

The small size is intentional. It means:

The policy was trained on a dataset of ~100K (context, query, keep/drop) examples generated from real production prompts across customer support, code generation, data extraction, and Q&A tasks. Training used a contrastive loss that penalizes dropping segments that contain information needed to answer the query.

4. Compiler Mode

In addition to the basic scoring policy, SuperCompress offers a compiler mode that applies a second-pass optimization. The compiler:

  1. Deduplicates — removes repeated information across segments, keeping the most complete version
  2. Strips tool noise — removes boilerplate from tool outputs, stack traces, and logging statements that don't contribute to the answer
  3. Preserves dependencies — if a later segment references an earlier definition, both are kept even if one scores low individually
  4. Reports verification risk — the compiler flags segments where dropping them carries uncertainty, producing a risk score alongside the compressed output

On the bundled long-context presets, compiler mode achieves 82.5% average token savings with low-risk verification — compared to ~65% for the basic scoring policy alone.

5. The Verifier

The verifier is a lightweight post-processing step that checks the compressed output for potential information loss. It answers two questions:

The verifier produces a risk score (low, medium, high) for each compression. Applications can use this score to decide whether to use the compressed output, fall back to the original, or route to a more expensive model for verification.

In practice, >95% of compressions across the benchmark suite receive a low-risk score, and the high-risk cases are typically very short prompts where compression has little room to operate.

6. Deployment Architecture

SuperCompress ships in three forms:

Form Use case Latency Setup
Python library Local in-process compression ~15ms pip install supercompress
Hosted API Zero-infrastructure compression ~5ms + network Get a key at /dashboard
GitHub Action CI/CD pipeline compression uses: arjunkshah/supercompress/.github/actions/...

All three forms share the same core policy and produce identical compression results for the same inputs. The hosted API adds automatic scaling, rate limiting, and usage analytics through the dashboard.

For framework-specific setup, see the integration guides for FastAPI, Flask, Django, Express.js, and Spring Boot.

7. Comparison with Alternatives

SuperCompress's architecture differs fundamentally from other prompt compression approaches. Here's how:

Method Approach Query-aware Model size Latency
SuperCompress Learned scoring policy Yes ~5K params ~15ms
Headroom Transformer classifier Type-only 165M+ ~100ms + warmup
Sliding window Fixed-position truncation No 0 ~0ms
Top-K retrieval Embedding similarity Partial Embedding model ~50ms
HyDE Hypothetical doc + retrieval Yes LLM + embedding ~500ms+

For a comprehensive comparison, see the alternatives hub and the full guide.

Try it in the playground → GitHub Get an API key