Technical guide

How prompt compression works

SuperCompress uses a learned policy of approximately 5,000 parameters — tiny compared to the LLMs it optimizes for. Here is how it works under the hood.

By Arjun Shah - Creator of SuperCompress - Updated 2026-07-03

The scoring policy

The compressor breaks your context into blocks and scores each block against the query. Blocks containing information likely to help answer the question get high scores. Blocks containing irrelevant information get low scores and are removed.

The policy was trained on thousands of question-answering examples to learn what types of content matter for answering different kinds of questions.

Runtime behavior

When you call compressor.compress(context, query):

  1. Context is split into ~100-token blocks
  2. Each block is scored against the query
  3. Low-scoring blocks are removed
  4. The remaining blocks are reassembled in order
  5. Token counts and compression metrics are returned

The entire process takes ~60ms on a single CPU core, with no GPU required.

Frequently asked questions

How was the policy trained?

The policy was trained on a diverse dataset of question-answering pairs, learning to predict which context lines are necessary for correct answers.

Can I retrain the policy on my data?

The current release uses a pre-trained policy. Custom training is on the roadmap.

Try it yourself

Paste your long prompt into the playground, ask a question, and see what SuperCompress keeps and removes. Free, no signup needed.

Open the Playground Embed the badge