Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage

AuthorsYichen Jiang, Marco Del Vecchio, Anders Johannsen, Mohit Bansal

Long prompts present a significant challenge for practical LLM-based systems that need to operate with low latency and limited resources. We investigate prompt compression for zero-shot dialogue systems that learn to use unseen APIs directly in-context from their documentation, which may take up hundreds of prompt tokens per API. We start from a recently introduced approach (Mu et al., 2023) that learns to compress the prompt into a few “gist token” activations during finetuning. However, this simple idea is ineffective in compressing API documentation, resulting in low accuracy compared to the baseline using an uncompressed prompt. In this work, we introduce two major improvements. First, we specialize gist tokens for different hierarchies within an API: we use one Gist_arg token for compressing an argument and one Gist_value token for compressing an acceptable value of a categorical argument. We then dynamically reveal Gist_value tokens only when they are needed. Second, we add a reconstruction loss to predict the API documentation from the gist tokens. On multiple API-calling tasks, our proposed system keeps the simplicity, efficiency, and large compression factor (20x on SGD) of the gist token approach while achieving significantly better accuracy.

Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage

Related readings and updates.

Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications

Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments

Discover opportunities in Machine Learning.