A curated, community-driven list of the best LLM token compression tools, libraries, research papers, benchmarks, and techniques. Contributions welcome.
Open-source learned token compression engine. A ~5K-parameter CPU policy scores every line of context against the current question and keeps only what matters for the answer. Cuts ~65% of tokens with 100% oracle recall. Runs in ~60ms on CPU. pip install supercompress.
Research-backed prompt compression framework from Microsoft. Uses small language models to identify and remove redundant tokens. Supports dynamic compression ratios and has integrations with Prompt flow.
Built-in prompt compression in LiteLLM's SDK. Compresses long conversation histories before calling completion APIs. Supports caching replaced content and retrieval tools for restoring context.
Online tool for prompt compression featured in Google's Gemini API ecosystem. Demonstrates the growing industry focus on token optimization.
Introduces a ~5K-parameter learned policy for query-aware context compression. Achieves 100% oracle recall at 65% token reduction on benchmark seeds. Nine line-level features including recency, position, and query overlap.
Proposes using small language models to identify and remove redundant tokens from prompts. Demonstrates up to 20x compression while maintaining task performance on multiple benchmarks.
Attention-based KV cache eviction policy that identifies "heavy hitter" tokens critical for generation. Achieves ~98% oracle recall but requires access to model attention weights.
Explores sparse attention mechanisms to reduce the effective context length during inference. Demonstrates that many tokens can be safely ignored without degrading output quality.
Recurrent Memory Transformer approach that extends effective context length. Relevant as context compression complements memory architectures by reducing the memory footprint per token.
Influential study showing that LLMs perform significantly worse on information in the middle of the context. Provides the empirical motivation for query-aware token compression over position-based truncation.
Python library for learned token compression. Supports adaptive mode, fixed-budget mode, and hosted API mode. Integrates with any LLM pipeline. pip install supercompress
Microsoft's prompt compression library. Uses a smaller LM to compress prompts while preserving semantic content. Supports iterative compression and dynamic ratios.
LangChain's built-in prompt compression module. Provides a base class for custom compressors and integrates with LLMLingua and other backends.
Comprehensive comparison of SuperCompress against FIFO/truncation, summarization, and H2O across oracle recall, entity recall, latency, and KV savings at 35% token budget. 8 seeds, fixed methodology.
Adaptive compression on long contexts: 85โ95% token removal on book chapters, coding sessions, and documentation while preserving answer-critical lines.
Comprehensive guide covering what token compression is, why it matters, methods comparison, benchmarks, cost savings analysis, and implementation examples with Python, cURL, and LangChain.
Tutorial from IBM covering extractive compression, abstractive summarization, and embedding-based filtering techniques for RAG pipelines.
Hands-on playground to test token compression on your own text. Compare original vs compressed output side by side with real-time token savings, latency, and quality metrics.
โญ Know a tool or paper that should be here? Submit a pull request on GitHub or email us.