Model optimization
Llama 3 compression
Self-hosted Llama 3 reduces API costs but the compute cost of processing long prompts remains. Compression cuts prompt size by 65%, reducing GPU memory and inference time.
Self-hosted economics
Llama 3 70B on an A100 GPU processes ~30 tokens/second. A 4,000-token prompt takes ~133 seconds just for prefill. Compression to 1,400 tokens reduces prefill to ~47 seconds. That is 86 more seconds of generation capacity per query.
Frequently asked questions
Does compression increase throughput?
Yes. By reducing prefill time by ~65%, you can serve more queries per GPU.
Does it reduce GPU memory usage?
Yes. Smaller prompts require less KV-cache memory, allowing larger batch sizes.
Try it yourself
Paste your long prompt into the playground, ask a question, and see what SuperCompress keeps and removes. Free, no signup needed.