Prompt Compression in Large Language Models

Ashish Bhatia

Product Manager @ Microsoft

Published Dec 29, 2023

Introduction

In the landscape of large language models like GPT-4, the issue of prompt length emerges as a critical consideration. The currency of transaction in the LLMs world is 'tokens'. Each word or piece of information is represented as a token, and the number of these tokens dictates the complexity and computational load of processing language data. Just like high-resolution images in digital media are made up of numerous pixels, lengthy prompts in language models are composed of a large number of tokens. These sophisticated models, designed to understand and generate human-like text, face a challenge similar to digital media: handling large volumes of data efficiently. Just as high-resolution images demand substantial storage and processing, extensive prompts in language models increase computational requirements and hence the cost of each LLM request. The concept of prompt compression, therefore, is not just a parallel to image or audio compression but a necessity. It involves reducing the size of input data (prompts) while preserving its essential meaning, enhancing the models' response efficiency and reducing computational resources. This process is vital for the practical application of these models, ensuring they remain fast and cost-effective in various real-world scenarios.

Methodologies in Prompt Compression

The paper LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models introduces innovative methodologies for prompt compression in language models like GPT-4. These methods are designed to make prompts shorter and easier for the models to process, while still keeping the crucial information. This process consists of several innovative strategies:

Budget Controller:

This technique involves smartly dividing the prompt into different parts (like instructions, examples, and questions) and deciding how much each part should be compressed. It's like balancing quality and size in image compression, but here it's about keeping the important parts of a prompt clear and concise. It ensures that crucial sections like instructions and questions are less compressed compared to potentially redundant demonstrations. This selective compression is analogous to how variable bitrate works in audio compression, focusing on retaining quality where it matters most.

The Budget Controller is designed to dynamically allocate different compression ratios to various components of a prompt, such as instructions, demonstrations, and questions. This allocation occurs at the sentence or demonstration level, with two key considerations:

Influence of Instruction and Question: Instructions and questions in a prompt directly influence the generated results, as they should contain all necessary knowledge to generate the following answer. These components are thus given priority in terms of preserving their content during compression.
Handling Redundant Demonstrations: If a prompt contains multiple demonstrations or few-shot examples, there's a possibility of redundant information. The Budget Controller addresses this by potentially allocating a smaller budget (i.e., a higher compression ratio) for demonstrations, as they might not all be necessary to achieve the desired outcome.

This approach to coarse-grained compression involves using a small language model, such as GPT-2 or LLaMA to calculate the perplexity of each demonstration, helping to determine which parts of the prompt are most essential and how to compress them effectively while maintaining the overall integrity and effectiveness of the prompt. Demonstrations are then selected in descending order of their perplexity values. The idea is to prioritize demonstrations that are more complex or informative (as indicated by higher perplexity) for retention in the compressed prompt.

Iterative Token-Level Prompt Compression:

This is a step-by-step process where the prompt is broken down into smaller pieces, and each piece is compressed carefully. This iterative process breaks down the prompt into segments, compressing each segment while maintaining the semantic integrity. This technique addresses the challenge of preserving the contextual relationship between tokens, akin to ensuring that key frequencies are retained in audio compression.

The Iterative Token-level Prompt Compression (ITPC) algorithm in the paper works as follows:

Segmentation of the Prompt: The algorithm starts by dividing the target prompt into several segments. This segmentation is done after an initial coarse-grained demonstration-level compression has been applied. A small language model, like GPT-2 or LLaMA, is used to compute the perplexity of each segment in the prompt. This step is crucial as it helps in understanding the complexity and information density of different parts of the prompt.
Iterative Compression Process: The algorithm then iteratively works through these segments. For each segment, it calculates the conditional probabilities and determines a compression threshold.
Preservation of Semantic Integrity: The ITPC method aims to maintain the semantic integrity of the prompt, even under high compression ratios. It does this by carefully selecting which tokens to compress based on their calculated probabilities and information content.

This iterative process ensures that the final compressed prompt retains the essential information and structure required for the large language model to understand and respond effectively, while significantly reducing the size of the input.

Findings and Implications of Prompt Compression Techniques

The paper presents results demonstrating the effectiveness of these prompt compression techniques in large language models like GPT-4. Key findings include:

Efficient Compression: The methodologies significantly reduce prompt length without compromising the quality of responses from the language model.
Practical Application: The compressed prompts lead to faster response times and reduced computational load, making these language models more practical for real-world applications.
Future Potential: These techniques open up possibilities for more complex applications of language models, where prompt length and computational efficiency are critical factors.

The paper concludes that prompt compression is not just a technical achievement but a necessary step towards making advanced language models more accessible and usable in diverse settings.

Conclusion

In conclusion, the paper is a key first step in the field of prompt compression. By effectively reducing prompt length while maintaining the integrity of information, this research paves the way for more efficient and cost-effective use of LMMs. The methodologies proposed offer a baseline for future research and applications, highlighting the importance of data efficiency in the ever-evolving landscape of Generative AI.

Paper: https://arxiv.org/pdf/2310.05736.pdf

GitHub: https://github.com/microsoft/LLMLingua

Researchers: Huiqiang Jiang , Qianhui Wu , Chin-Yew Lin , Yuqing Yang , Lili Qiu

Wojciech Ozimek

7mo

Thanks. We are looking for prompt compression for multimodal prompts (asking the same question 1000+ times - interpreting photo library). The architecture/process looks like a good fit.

Chris Mann

AI Product Management. Former LinkedIn, IBM, Bizo, 1Password and several 0-1's.

8mo

Killer topic Ashish I've been researching the larger topic of cost reduction / management and I wonder if this one as you get into deeper techniques becomes troublesome to manage. I think all of this will be troublesome to manage! Moving from one model to the next on its own can cause bad results and or the need to rewrite your prompt for the new model. Model routing is a cool idea yet now you have to loop through your prompt engineering say three times one for each of the new three models that potentially will handle the request to make sure the prompts work properly at each model where previously it was all handled by one GPT4 prompt. Then you get into these different compression techniques and you get similar new complexities. This is a great article that you've written brother!

1 Reaction

James V Baber

AI-First Product & Technology Leadership

10mo

I like the concept and appreciate the depth of research (the paper has a lot of formulas!). I hope this can be a stepping stone towards better and less expensive RAG model outputs? For business application, I remain challenged with solutions that equate to "use even more models before chatting with your LLM", and "oh, by the way, be sure to take the time and effort to train a small model as well". The authors used the most expensive LLM (GPT-4) for prompt compression. If you're going to send the full prompt to GPT-4 and pay the token cost already, why take further steps? Furthermore, prompt compression introduces far more variability in outcomes, almost always for the worse. This is mathematically probable due to fidelity loss. While academically interesting, these solutions seem to be less practical for business applications, but perhaps they do add to the domain of knowledge. Not sure what to do with this one with my clients, but I appreciate the summary.

2 Reactions

Matthew Meyer

Director - Technology Evangelist | IT Leader & Innovator | AI Solutions Architect

10mo

Very interesting. Starts to look like JavaScript minimization.

1 Reaction

Hadrien-Nessim Socard

Global Digital Workplace Director @ SEPHORA - Microsoft MVP, Modern Work, Copilots and Low Code

10mo

Romain CHAUSSEDOUX

1 Reaction

See more comments

To view or add a comment, sign in

Sign in

Stay updated on your professional world

By clicking Continue to join or sign in, you agree to LinkedInâ€™s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Directory

Prompt Compression in Large Language Models

Ashish Bhatia

Product Manager @ Microsoft

Introduction

Methodologies in Prompt Compression

Budget Controller:

Iterative Token-Level Prompt Compression:

Recommended by LinkedIn

Distribution Alignment:

Findings and Implications of Prompt Compression Techniques

Conclusion

More articles by Ashish Bhatia

Sign in

Insights from the community

Others also viewed

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

Unleashing the Power of LLMs with Flash Attention

How Large Language Models (LLMs) Work and How They Are Developed

Large Language Models

Unlocking the Power of Retrieval-Augmented Generation (RAG) in the Age of Long-Context Language Models: A Critical Perspective

Training, Tuning, and Retrieval: How Large Language Models Get Smart

Exploring the Power of Large Language Models (LLMs): A New Era in AI

Most Companies Use LLMs Wrong. Hereâ€™s Why

Explore topics

Directory

Introduction

Methodologies in Prompt Compression

Budget Controller:

Iterative Token-Level Prompt Compression:

Recommended by LinkedIn

Distribution Alignment:

Findings and Implications of Prompt Compression Techniques

Conclusion

More articles by Ashish Bhatia

Welcome to Answer Economy

AI Agents: Separating Reality from Ambition

Building natural language actions in Copilot Studio

Voice is the New User Experience

How Instruction Hierarchy can Enhance LLM Safety and Functionality

A Simple LLM Fine-Tuning with LoRA Guide for Citizen Developers

Chapter 1: AI Agents and Agentic Behavior

Agent AI systems - Another step towards AGI

Do You Feel the AI Guilt? But Why?

AI's Exponential Journey: Milestones to AGI and Beyond

Sign in

Insights from the community

Others also viewed

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

Unleashing the Power of LLMs with Flash Attention

How Large Language Models (LLMs) Work and How They Are Developed

Large Language Models

Unlocking the Power of Retrieval-Augmented Generation (RAG) in the Age of Long-Context Language Models: A Critical Perspective

Training, Tuning, and Retrieval: How Large Language Models Get Smart

Exploring the Power of Large Language Models (LLMs): A New Era in AI

Most Companies Use LLMs Wrong. Hereâ€™s Why

Explore topics