Reduce the latency and cost of LLM inference with prompt compression

January 18, 2024

A new paper from Microsoft proposes using small models to compress prompts before passing them to larger models like gpt-4. The researchers were able to both achieve up to a 20x reduction in prompt tokens with some performance loss or a 4x reduction with a performance increase. Performance in this case means produced the desired output[1].

Usage is straightforward:

from llmlingua import PromptCompressor
llm_lingua = PromptCompressor()

compressed_prompt = llm_lingua.compress_prompt(

instruction = "Please reference the following examples to answer the math question,\n"
prompt = instruction + compressed_prompt["compressed_prompt"] + "\n\nQuestion: " + question

request_data = {
    "prompt": prompt,
    "max_tokens": 400,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
    "stop": "\r\n",
response = openai.Completion.create(

There are 4 big challenges to deploying LLMs in production performance, cost, latency and security. This project hits 3 of the 4. Though it is possible that this approach might even be useful to mitigate prompt injection if a small model that was trained to recognize and strip prompt injection attempts were created.