OpenAI integration guide
OpenAI prompt compression integration
SuperCompress integrates with the OpenAI Python SDK by wrapping the client. Every API call automatically compresses context before sending, reducing costs without changing your application logic.
Why compress OpenAI prompts
GPT-4o costs $2.50 per million input tokens. A typical agent making 1,000 calls per day with 4,000-token prompts spends $10/day on input alone. With SuperCompress compressing 65% of tokens, the same agent costs $3.50/day. Over a year, that is $2,372 saved for a single agent deployment.
Beyond cost, compressed prompts reduce prefill latency. GPT-4o with a 4,000-token prompt has a higher time-to-first-token than with a 1,400-token prompt. Compression improves both the financial and user experience dimensions.
Drop-in wrapper integration
The cleanest integration pattern is a wrapper class that inherits from openai.OpenAI and overrides the chat completions method. This way, every call to your existing client gets compression automatically.
from openai import OpenAI
from supercompress import Compressor
class SuperCompressOpenAI(OpenAI):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._compressor = Compressor()
def chat(self, *args, **kwargs):
messages = kwargs.get("messages", [])
if len(messages) > 1:
# Compress the conversation history, keep the latest message intact
history = "
".join(m.get("content", "") for m in messages[:-1])
query = messages[-1].get("content", "")
if history and query:
result = self._compressor.compress(history, query)
# Replace history with compressed version
messages[:-1] = [{"role": "user", "content": result.compressed_text}]
kwargs["messages"] = messages
return super().chat(*args, **kwargs)
client = SuperCompressOpenAI(api_key="sk-...")
response = client.chat(messages=[{...}])
Streaming support
Compression works before streaming begins. Compress the context first, then pass the compressed text into a standard streaming call. The streaming behavior is unchanged — you get the same token-by-token response, just with fewer input tokens billed.
from supercompress import Compressor
comp = Compressor()
compressed = comp.compress(long_context, query)
stream = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer based on the context."},
{"role": "user", "content": compressed.compressed_text},
],
stream=True
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")
Cost impact at scale
| Daily Calls | Tokens/Call | Monthly Cost (GPT-4o) | With Compression | Savings |
|---|---|---|---|---|
| 1,000 | 4,000 | $300 | $105 | $195 |
| 10,000 | 4,000 | $3,000 | $1,050 | $1,950 |
| 100,000 | 8,000 | $60,000 | $21,000 | $39,000 |
Frequently asked questions
Does the wrapper work with GPT-4, GPT-4 Turbo, and GPT-4o mini?
Yes. The wrapper is model-agnostic. It compresses before the API call regardless of which model you use. Works with all OpenAI chat models.
Will compression break function calling?
No. The wrapper only compresses the message content. Function definitions and tool schemas are passed through unchanged.
Can I use it with the async OpenAI client?
Yes. Create a similar wrapper around <code>AsyncOpenAI</code> using the same pattern. The compressor is synchronous and fast (~60ms), so you can call it inside async functions without blocking.
Try it yourself
Paste your long prompt into the playground, ask a question, and see what SuperCompress keeps and removes. Free, no signup needed.