Context compression guide
Context compression for agents and RAG
Context compression turns oversized agent memory, retrieved documents, and logs into a smaller context that still answers the user's question.
Why agents need context compression
AI agents accumulate context with every turn: conversation history, tool call results, code outputs. After 5-10 turns, a typical agent prompt can exceed 10,000 tokens.
Without context compression, cost balloons, latency increases, and quality degrades as the model sifts through noise.
RAG pipeline integration
from supercompress import Compressor
comp = Compressor()
def rag_with_compression(query, retriever, llm):
docs = retriever.retrieve(query)
context = "
".join([d.text for d in docs])
result = comp.compress(context, query)
return llm.generate(query, result.compressed_text)
Frequently asked questions
Should I compress before or after retrieval?
After retrieval. Retrieve broadly, then compress around the current question.
Can context compression improve latency?
Yes. Fewer prompt tokens reduce GPU prefill time.
Try it yourself
Paste your long prompt into the playground, ask a question, and see what SuperCompress keeps and removes. Free, no signup needed.