A new paper from Microsoft proposes using small models to compress prompts before passing them to larger models like gpt-4. The researchers were able to both achieve up to a 20x reduction in prompt tokens with some performance loss or a 4x reduction with a performance increase. Performance in this case means produced the desired output[1].
Usage is straightforward:
from llmlingua import PromptCompressor
llm_lingua = PromptCompressor()
compressed_prompt = llm_lingua.compress_prompt(
prompt_complex.split("\n\n"),
instruction="",
question="",
target_token=200,
context_budget="*1.5",
iterative_size=100,
)
instruction = "Please reference the following examples to answer the math question,\n"
prompt = instruction + compressed_prompt["compressed_prompt"] + "\n\nQuestion: " + question
request_data = {
"prompt": prompt,
"max_tokens": 400,
"temperature": 0,
"top_p": 1,
"n": 1,
"stream": False,
"stop": "\r\n",
}
response = openai.Completion.create(
"gpt-3.5-turbo-0301",
**request_data,
)
There are 4 big challenges to deploying LLMs in production performance, cost, latency and security. This project hits 3 of the 4. Though it is possible that this approach might even be useful to mitigate prompt injection if a small model that was trained to recognize and strip prompt injection attempts were created.