Technical guide
How prompt compression works
SuperCompress uses a learned policy of approximately 5,000 parameters — tiny compared to the LLMs it optimizes for. Here is how it works under the hood.
The scoring policy
The compressor breaks your context into blocks and scores each block against the query. Blocks containing information likely to help answer the question get high scores. Blocks containing irrelevant information get low scores and are removed.
The policy was trained on thousands of question-answering examples to learn what types of content matter for answering different kinds of questions.
Runtime behavior
When you call compressor.compress(context, query):
- Context is split into ~100-token blocks
- Each block is scored against the query
- Low-scoring blocks are removed
- The remaining blocks are reassembled in order
- Token counts and compression metrics are returned
The entire process takes ~60ms on a single CPU core, with no GPU required.
Frequently asked questions
How was the policy trained?
The policy was trained on a diverse dataset of question-answering pairs, learning to predict which context lines are necessary for correct answers.
Can I retrain the policy on my data?
The current release uses a pre-trained policy. Custom training is on the roadmap.
Try it yourself
Paste your long prompt into the playground, ask a question, and see what SuperCompress keeps and removes. Free, no signup needed.