OpenAI integration guide

OpenAI prompt compression integration

SuperCompress integrates with the OpenAI Python SDK by wrapping the client. Every API call automatically compresses context before sending, reducing costs without changing your application logic.

By Arjun Shah - Creator of SuperCompress - Updated 2026-07-03

Why compress OpenAI prompts

GPT-4o costs $2.50 per million input tokens. A typical agent making 1,000 calls per day with 4,000-token prompts spends $10/day on input alone. With SuperCompress compressing 65% of tokens, the same agent costs $3.50/day. Over a year, that is $2,372 saved for a single agent deployment.

Beyond cost, compressed prompts reduce prefill latency. GPT-4o with a 4,000-token prompt has a higher time-to-first-token than with a 1,400-token prompt. Compression improves both the financial and user experience dimensions.

Drop-in wrapper integration

The cleanest integration pattern is a wrapper class that inherits from openai.OpenAI and overrides the chat completions method. This way, every call to your existing client gets compression automatically.

from openai import OpenAI
from supercompress import Compressor

class SuperCompressOpenAI(OpenAI):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._compressor = Compressor()

    def chat(self, *args, **kwargs):
        messages = kwargs.get("messages", [])
        if len(messages) > 1:
            # Compress the conversation history, keep the latest message intact
            history = "
".join(m.get("content", "") for m in messages[:-1])
            query = messages[-1].get("content", "")
            if history and query:
                result = self._compressor.compress(history, query)
                # Replace history with compressed version
                messages[:-1] = [{"role": "user", "content": result.compressed_text}]
                kwargs["messages"] = messages
        return super().chat(*args, **kwargs)

client = SuperCompressOpenAI(api_key="sk-...")
response = client.chat(messages=[{...}])

Streaming support

Compression works before streaming begins. Compress the context first, then pass the compressed text into a standard streaming call. The streaming behavior is unchanged — you get the same token-by-token response, just with fewer input tokens billed.

from supercompress import Compressor
comp = Compressor()

compressed = comp.compress(long_context, query)
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Answer based on the context."},
        {"role": "user", "content": compressed.compressed_text},
    ],
    stream=True
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

Cost impact at scale

Daily Calls	Tokens/Call	Monthly Cost (GPT-4o)	With Compression	Savings
1,000	4,000	$300	$105	$195
10,000	4,000	$3,000	$1,050	$1,950
100,000	8,000	$60,000	$21,000	$39,000

Frequently asked questions

Does the wrapper work with GPT-4, GPT-4 Turbo, and GPT-4o mini?

Yes. The wrapper is model-agnostic. It compresses before the API call regardless of which model you use. Works with all OpenAI chat models.

Will compression break function calling?

No. The wrapper only compresses the message content. Function definitions and tool schemas are passed through unchanged.

Can I use it with the async OpenAI client?

Yes. Create a similar wrapper around <code>AsyncOpenAI</code> using the same pattern. The compressor is synchronous and fast (~60ms), so you can call it inside async functions without blocking.

Try it yourself

Paste your long prompt into the playground, ask a question, and see what SuperCompress keeps and removes. Free, no signup needed.

Open the Playground See benchmarks