SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

AuthorsRasoul Shafipour, David Harrison, Maxwell Horton, Jeffrey Marker, Houman Bedayat, Sachin Mehta†, Mohammad Rastegari†, Mahyar Najibi, Saman Naderiparizi

View publication

Large Language Models (LLMs) have transformed natural language processing, but face significant challenges in widespread deployment due to their high runtime cost. In this paper, we introduce SeedLM, a novel post-training compression method that uses seeds of a pseudo-random generator to encode and compress model weights. Specifically, for each block of weights, we find a seed that is fed into a Linear Feedback Shift Register (LFSR) during inference to efficiently generate a random matrix. This matrix is then linearly combined with compressed coefficients to reconstruct the weight block. SeedLM reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses. Unlike state-of-the-art methods that rely on calibration data, our approach is data-free and generalizes well across diverse tasks. Our experiments with Llama3 70B, which is particularly challenging, show zero-shot accuracy retention at 4- and 3-bit compression to be on par with or better than state-of-the-art methods, while maintaining performance comparable to FP16 baselines. Additionally, FPGA-based tests demonstrate that 4-bit SeedLM, as model size increases, approaches a 4x speed-up over an FP16 Llama 2/3 baseline.

† Meta

SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

Related readings and updates.

Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments

Lossless Compression of Efficient Private Local Randomizers

Discover opportunities in Machine Learning.