Advanced guide

Streaming compression strategies

Most compression happens synchronously before the LLM call. Streaming compression overlaps compression with LLM streaming to hide latency entirely.

By Arjun Shah - Creator of SuperCompress - Updated 2026-07-03

The overlap technique

While the LLM streams its response to the current query, compress the conversation history for the next turn in parallel. By the time the user asks their next question, the compressed history is ready — zero additional latency.

Implementation

from supercompress import Compressor
comp = Compressor()
import asyncio

async def chat_with_overlap(messages, user_query):
    # Start compression of current history in background
    history = "\n".join(m["content"] for m in messages[:-1])
    compressed_future = asyncio.create_task(
        asyncio.to_thread(comp.compress, history, user_query)
    )
    # Stream the LLM response
    response = await llm.stream(messages, user_query)
    # Compressed result is ready for next turn
    compressed = await compressed_future
    return response, compressed.compressed_text

Frequently asked questions

Does this add any latency?

No. The compression happens in parallel with the LLM response. Zero additional latency.

What if the user asks multiple questions rapidly?

Compress the accumulated history after the last question. The overlap technique handles rapid-fire questions gracefully.

Try it yourself

Paste your long prompt into the playground, ask a question, and see what SuperCompress keeps and removes. Free, no signup needed.

Open the Playground Embed the badge