⭐ Curated 📅 Updated 2026 🐍 Python & more 🔬 Research + Tools

Awesome Token Compression ✨

A curated, community-driven list of the best LLM token compression tools, libraries, research papers, benchmarks, and techniques. Contributions welcome.

🗂️ 4 Tools 📄 6 Papers & Research 📚 3 Libraries & SDKs ⚡ 2 Benchmarks 📖 3 Guides & Tutorials

🔧 Tools (4)

SuperCompress MIT

github.com/arjunkshah/supercompress · supercompress.dev

Open-source learned token compression engine. A ~5K-parameter CPU policy scores every line of context against the current question and keeps only what matters for the answer. Cuts ~65% of tokens with 100% oracle recall. Runs in ~60ms on CPU. pip install supercompress.

⭐ 100% oracle recall ⚡ ~60ms CPU 🐍 Python 🌐 Hosted API

LLMLingua MIT

github.com/microsoft/LLMLingua · Microsoft Research

Research-backed prompt compression framework from Microsoft. Uses small language models to identify and remove redundant tokens. Supports dynamic compression ratios and has integrations with Prompt flow.

⭐ Top-rated on GitHub 📦 pip install llmlingua 🔬 Research-backed

LiteLLM Prompt Compression Apache 2.0

docs.litellm.ai

Built-in prompt compression in LiteLLM's SDK. Compresses long conversation histories before calling completion APIs. Supports caching replaced content and retrieval tools for restoring context.

🔄 Built-in SDK 🔗 LangChain integration 📦 20K+ monthly downloads

Prompt Compress (Google AI) Research

Featured in Gemini API Developer Competition

Online tool for prompt compression featured in Google's Gemini API ecosystem. Demonstrates the growing industry focus on token optimization.

📄 Research Papers (6)

SuperCompress: Efficient LLM Context Compression

Shah, Arjun (2025) · supercompress.dev/research

Introduces a ~5K-parameter learned policy for query-aware context compression. Achieves 100% oracle recall at 65% token reduction on benchmark seeds. Nine line-level features including recency, position, and query overlap.

LLMLingua: Compressing Prompts for Accelerated Inference

Microsoft Research (2024) · arxiv.org/abs/2310.05736

Proposes using small language models to identify and remove redundant tokens from prompts. Demonstrates up to 20x compression while maintaining task performance on multiple benchmarks.

H2O: Heavy-Hitter Oracle for Efficient Generative Inference

MIT & IBM Research (2024) · arxiv.org/abs/2306.14048

Attention-based KV cache eviction policy that identifies "heavy hitter" tokens critical for generation. Achieves ~98% oracle recall but requires access to model attention weights.

Selective Context: Leveraging Sparse Attention for Efficient Inference

Princeton & Google Research (2024) · arxiv.org/abs/2401.02669

Explores sparse attention mechanisms to reduce the effective context length during inference. Demonstrates that many tokens can be safely ignored without degrading output quality.

Scaling Transformer to 1M Tokens and Beyond with RMT

DeepMind (2024) · arxiv.org/abs/2401.12960

Recurrent Memory Transformer approach that extends effective context length. Relevant as context compression complements memory architectures by reducing the memory footprint per token.

Lost in the Middle: How Language Models Use Long Contexts

Stanford (2024) · arxiv.org/abs/2307.03172

Influential study showing that LLMs perform significantly worse on information in the middle of the context. Provides the empirical motivation for query-aware token compression over position-based truncation.

📚 Libraries & SDKs (3)

supercompress (Python) MIT

pypi.org/project/supercompress

Python library for learned token compression. Supports adaptive mode, fixed-budget mode, and hosted API mode. Integrates with any LLM pipeline. pip install supercompress

llmlingua (Python) MIT

pypi.org/project/llmlingua

Microsoft's prompt compression library. Uses a smaller LM to compress prompts while preserving semantic content. Supports iterative compression and dynamic ratios.

LangChain Compression (Python/JS) MIT

langchain.com

LangChain's built-in prompt compression module. Provides a base class for custom compressors and integrates with LLMLingua and other backends.

⚡ Benchmarks & Comparisons (2)

SuperCompress vs Baselines

supercompress.dev/benchmarks

Comprehensive comparison of SuperCompress against FIFO/truncation, summarization, and H2O across oracle recall, entity recall, latency, and KV savings at 35% token budget. 8 seeds, fixed methodology.

🏆 100% oracle recall 📊 8 benchmark seeds 🔬 Reproducible

Adaptive Mode Benchmarks

supercompress.dev/benchmarks · Real-world contexts

Adaptive compression on long contexts: 85–95% token removal on book chapters, coding sessions, and documentation while preserving answer-critical lines.

📖 Guides & Tutorials (3)

Token Compression for LLMs: The Complete Guide

supercompress.dev/token-compression

Comprehensive guide covering what token compression is, why it matters, methods comparison, benchmarks, cost savings analysis, and implementation examples with Python, cURL, and LangChain.

IBM Prompt Compression Tutorial

ibm.com/think/tutorials/prompt-compression

Tutorial from IBM covering extractive compression, abstractive summarization, and embedding-based filtering techniques for RAG pipelines.

SuperCompress Interactive Demo

supercompress.dev/playground · Comparison tool

Hands-on playground to test token compression on your own text. Compare original vs compressed output side by side with real-time token savings, latency, and quality metrics.

⭐ Know a tool or paper that should be here? Submit a pull request on GitHub or email us.