RAG guide

RAG token optimization with compression

Retrieval augmented generation typically retrieves 10-20 chunks totaling 5,000-10,000 tokens. Compression removes irrelevant chunks before the LLM call.

By Arjun Shah - Creator of SuperCompress - Updated 2026-07-03

The RAG token problem

In a typical RAG pipeline, the retriever returns 10-15 document chunks. Only 2-3 of those chunks contain information relevant to the specific query. The rest are noise that costs tokens and dilutes the model's focus.

SuperCompress sits between retrieval and generation: it scores each retrieved chunk against the query and keeps only the chunks most likely to contain the answer.

RAG with compression

from supercompress import Compressor
comp = Compressor()

def optimized_rag(query, retriever, llm):
    chunks = retriever.retrieve(query, k=15)
    full_context = "

".join(c.text for c in chunks)
    result = comp.compress(full_context, query)
    return llm.generate(query, result.compressed_text)

Cost comparison

Chunks RetrievedTokensWith CompressionSavings
105,00075085%
157,5001,12585%
2010,0001,50085%

Frequently asked questions

Does compression improve RAG quality?

Often yes. Removing noise helps the model focus on relevant evidence.

What embedding models work best?

Any embedding model works. Compression is embedding-agnostic.

Try it yourself

Paste your long prompt into the playground, ask a question, and see what SuperCompress keeps and removes. Free, no signup needed.

Open the Playground See benchmarks