โญ Curated ๐Ÿ“… Updated 2026 ๐Ÿ Python & more ๐Ÿ”ฌ Research + Tools

Awesome Token Compression โœจ

A curated, community-driven list of the best LLM token compression tools, libraries, research papers, benchmarks, and techniques. Contributions welcome.

๐Ÿ—‚๏ธ 4 Tools ๐Ÿ“„ 6 Papers & Research ๐Ÿ“š 3 Libraries & SDKs โšก 2 Benchmarks ๐Ÿ“– 3 Guides & Tutorials

๐Ÿ”ง Tools (4)

LLMLingua MIT

github.com/microsoft/LLMLingua ยท Microsoft Research

Research-backed prompt compression framework from Microsoft. Uses small language models to identify and remove redundant tokens. Supports dynamic compression ratios and has integrations with Prompt flow.

โญ Top-rated on GitHub ๐Ÿ“ฆ pip install llmlingua ๐Ÿ”ฌ Research-backed

LiteLLM Prompt Compression Apache 2.0

Built-in prompt compression in LiteLLM's SDK. Compresses long conversation histories before calling completion APIs. Supports caching replaced content and retrieval tools for restoring context.

๐Ÿ”„ Built-in SDK ๐Ÿ”— LangChain integration ๐Ÿ“ฆ 20K+ monthly downloads

Prompt Compress (Google AI) Research

Featured in Gemini API Developer Competition

Online tool for prompt compression featured in Google's Gemini API ecosystem. Demonstrates the growing industry focus on token optimization.

๐Ÿ“„ Research Papers (6)

SuperCompress: Efficient LLM Context Compression

Shah, Arjun (2025) ยท supercompress.dev/research

Introduces a ~5K-parameter learned policy for query-aware context compression. Achieves 100% oracle recall at 65% token reduction on benchmark seeds. Nine line-level features including recency, position, and query overlap.

LLMLingua: Compressing Prompts for Accelerated Inference

Microsoft Research (2024) ยท arxiv.org/abs/2310.05736

Proposes using small language models to identify and remove redundant tokens from prompts. Demonstrates up to 20x compression while maintaining task performance on multiple benchmarks.

H2O: Heavy-Hitter Oracle for Efficient Generative Inference

MIT & IBM Research (2024) ยท arxiv.org/abs/2306.14048

Attention-based KV cache eviction policy that identifies "heavy hitter" tokens critical for generation. Achieves ~98% oracle recall but requires access to model attention weights.

Selective Context: Leveraging Sparse Attention for Efficient Inference

Princeton & Google Research (2024) ยท arxiv.org/abs/2401.02669

Explores sparse attention mechanisms to reduce the effective context length during inference. Demonstrates that many tokens can be safely ignored without degrading output quality.

Scaling Transformer to 1M Tokens and Beyond with RMT

DeepMind (2024) ยท arxiv.org/abs/2401.12960

Recurrent Memory Transformer approach that extends effective context length. Relevant as context compression complements memory architectures by reducing the memory footprint per token.

Lost in the Middle: How Language Models Use Long Contexts

Stanford (2024) ยท arxiv.org/abs/2307.03172

Influential study showing that LLMs perform significantly worse on information in the middle of the context. Provides the empirical motivation for query-aware token compression over position-based truncation.

๐Ÿ“š Libraries & SDKs (3)

supercompress (Python) MIT

Python library for learned token compression. Supports adaptive mode, fixed-budget mode, and hosted API mode. Integrates with any LLM pipeline. pip install supercompress

llmlingua (Python) MIT

Microsoft's prompt compression library. Uses a smaller LM to compress prompts while preserving semantic content. Supports iterative compression and dynamic ratios.

LangChain Compression (Python/JS) MIT

LangChain's built-in prompt compression module. Provides a base class for custom compressors and integrates with LLMLingua and other backends.

โšก Benchmarks & Comparisons (2)

SuperCompress vs Baselines

Comprehensive comparison of SuperCompress against FIFO/truncation, summarization, and H2O across oracle recall, entity recall, latency, and KV savings at 35% token budget. 8 seeds, fixed methodology.

๐Ÿ† 100% oracle recall ๐Ÿ“Š 8 benchmark seeds ๐Ÿ”ฌ Reproducible

Adaptive Mode Benchmarks

supercompress.dev/benchmarks ยท Real-world contexts

Adaptive compression on long contexts: 85โ€“95% token removal on book chapters, coding sessions, and documentation while preserving answer-critical lines.

๐Ÿ“– Guides & Tutorials (3)

Token Compression for LLMs: The Complete Guide

Comprehensive guide covering what token compression is, why it matters, methods comparison, benchmarks, cost savings analysis, and implementation examples with Python, cURL, and LangChain.

IBM Prompt Compression Tutorial

Tutorial from IBM covering extractive compression, abstractive summarization, and embedding-based filtering techniques for RAG pipelines.

SuperCompress Interactive Demo

Hands-on playground to test token compression on your own text. Compare original vs compressed output side by side with real-time token savings, latency, and quality metrics.

โญ Know a tool or paper that should be here? Submit a pull request on GitHub or email us.