Advanced guide
Streaming compression strategies
Most compression happens synchronously before the LLM call. Streaming compression overlaps compression with LLM streaming to hide latency entirely.
The overlap technique
While the LLM streams its response to the current query, compress the conversation history for the next turn in parallel. By the time the user asks their next question, the compressed history is ready — zero additional latency.
Implementation
from supercompress import Compressor
comp = Compressor()
import asyncio
async def chat_with_overlap(messages, user_query):
# Start compression of current history in background
history = "\n".join(m["content"] for m in messages[:-1])
compressed_future = asyncio.create_task(
asyncio.to_thread(comp.compress, history, user_query)
)
# Stream the LLM response
response = await llm.stream(messages, user_query)
# Compressed result is ready for next turn
compressed = await compressed_future
return response, compressed.compressed_text
Frequently asked questions
Does this add any latency?
No. The compression happens in parallel with the LLM response. Zero additional latency.
What if the user asks multiple questions rapidly?
Compress the accumulated history after the last question. The overlap technique handles rapid-fire questions gracefully.
Try it yourself
Paste your long prompt into the playground, ask a question, and see what SuperCompress keeps and removes. Free, no signup needed.