QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
AuthorsRishabh Tiwari*, Haocheng Xi*, Aditya Tomar*, Coleman Hooper, Sehoon Kim, Maxwell Horton†, Mahyar Najibi†, Michael W. Mahoney†‡§, Kurt Keutzer†, Amir Gholami†‡
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
AuthorsRishabh Tiwari*, Haocheng Xi*, Aditya Tomar*, Coleman Hooper, Sehoon Kim, Maxwell Horton†, Mahyar Najibi†, Michael W. Mahoney†‡§, Kurt Keutzer†, Amir Gholami†‡
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates (>90%) and reliably provides consistent end-to-end speedups upto ∼2.5×, outperforming other self-speculative decoding methods that use sparse KV cache for long-context LLM inference. QuantSpec also reduces the memory requirements by ∼1.3× compared to these alternatives.
EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments
May 19, 2026research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICML
Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model’s memory footprint to quickly exceed device limits. While recent KV cache compression methods attempt to reduce memory usage, most apply cache eviction after processing the entire…
CommVQ: Commutative Vector Quantization for KV Cache Compression
July 11, 2025research area Speech and Natural Language Processingconference ICML
Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as con- text lengths grow. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long context LLM inference. First, we leverage additive quantization by introducing a lightweight encoder and codebook to compress the KV…