RAG guide
RAG token optimization with compression
Retrieval augmented generation typically retrieves 10-20 chunks totaling 5,000-10,000 tokens. Compression removes irrelevant chunks before the LLM call.
The RAG token problem
In a typical RAG pipeline, the retriever returns 10-15 document chunks. Only 2-3 of those chunks contain information relevant to the specific query. The rest are noise that costs tokens and dilutes the model's focus.
SuperCompress sits between retrieval and generation: it scores each retrieved chunk against the query and keeps only the chunks most likely to contain the answer.
RAG with compression
from supercompress import Compressor
comp = Compressor()
def optimized_rag(query, retriever, llm):
chunks = retriever.retrieve(query, k=15)
full_context = "
".join(c.text for c in chunks)
result = comp.compress(full_context, query)
return llm.generate(query, result.compressed_text)
Cost comparison
| Chunks Retrieved | Tokens | With Compression | Savings |
|---|---|---|---|
| 10 | 5,000 | 750 | 85% |
| 15 | 7,500 | 1,125 | 85% |
| 20 | 10,000 | 1,500 | 85% |
Frequently asked questions
Does compression improve RAG quality?
Often yes. Removing noise helps the model focus on relevant evidence.
What embedding models work best?
Any embedding model works. Compression is embedding-agnostic.
Try it yourself
Paste your long prompt into the playground, ask a question, and see what SuperCompress keeps and removes. Free, no signup needed.