Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as con- text lengths grow. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long context LLM inference. First, we leverage additive quantization by introducing a lightweight encoder and codebook to compress the KV cache, which can then be decoded with a simple matrix multiplication. Second, to tackle the high computational costs during decoding, we design the codebook to be commutative with Ro- tary Position Embedding (RoPE), and utilize an Expectation-Maximization (EM) algorithm to learn the codebook. This enables efficient integration of decoding into the self-attention mechanism, significantly reducing computational overhead. Our approach achieves superior accu- racy through additive quantization while lowering computational costs with our RoPE-commutative codebook. Experiments on long-context bench marks and GSM8K demonstrate that our method reduces FP16 KV cache size by 87.5% for 2-bit quantization, while maintaining higher accu- racy than state-of-the-art KV cache quantization methods. Remarkably, it enables 1-bit quanti- zation of the KV cache with minimal accuracy degradation, making it possible to run a LLaMA- 3.1 8B model with a maximum 128K context length on a single RTX 4090 GPU.

Related readings and updates.

Inference with transformer-based language models begins with a prompt processing step. In this step, the model generates the first output token and stores the KV cache needed for future generation steps. This prompt processing step can be computationally expensive, taking 10s of seconds or more for billion-parameter models on edge devices when prompt lengths or batch sizes rise. This degrades user experience by introducing significant latency…
Read more
Large Language Model or LLM inference has two phases, the prompt (or prefill) phase to output the first token and the extension (or decoding) phase to the generate subsequent tokens. In this work, we propose an efficient parallelization scheme, KV-Runahead to accelerate the prompt phase. The key observation is that the extension phase generates tokens faster than the prompt phase because of key-value cache (KV-cache). Hence, KV-Runahead…
Read more