Reduce the latency and cost of LLM inference with prompt compression

January 18, 2024

A new paper from Microsoft proposes using small models to compress prompts before passing them to larger models like gpt-4. The researchers were able to both achieve up to a 20x reduction in prompt tokens with some performance loss or a 4x reduction with a performance increase. Performance in this case means produced the desired output[1].

Usage is straightforward:

from llmlingua import PromptCompressor
llm_lingua = PromptCompressor()

compressed_prompt = llm_lingua.compress_prompt(
    prompt_complex.split("\n\n"),
    instruction="",
    question="",
    target_token=200,
    context_budget="*1.5",
    iterative_size=100,
)

instruction = "Please reference the following examples to answer the math question,\n"
prompt = instruction + compressed_prompt["compressed_prompt"] + "\n\nQuestion: " + question

request_data = {
    "prompt": prompt,
    "max_tokens": 400,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
    "stop": "\r\n",
}
response = openai.Completion.create(
    "gpt-3.5-turbo-0301",
    **request_data,
)

There are 4 big challenges to deploying LLMs in production performance, cost, latency and security. This project hits 3 of the 4. Though it is possible that this approach might even be useful to mitigate prompt injection if a small model that was trained to recognize and strip prompt injection attempts were created.

[1] https://llmlingua.com/